Vast Challenge 2021 Assignment

Welcome to Ong Chee Hong’s Vast Challenge 2021 Assignment

Ong Chee Hong https://www.linkedin.com/in/alexongch/
07-04-2021

Vast Challenge 2021: Mini-Challenge 2

Background

Many of the Abila, Kronos-based employees of GAStech have company cars which are approved for both personal and business use. Those who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.

Employees with company cars are happy to have these vehicles, because the company cars are generally much higher quality than the cars they would be able to afford otherwise. However, GAStech does not trust their employees. Without the employees? knowledge, GAStech has installed geospatial tracking software in the company vehicles. The vehicles are tracked periodically as long as they are moving.

This vehicle tracking data has been made available to law enforcement to support their investigation. Unfortunately, data is not available for the day the GAStech employees went missing. Data is only available for the two weeks prior to the disappearance.

To promote local businesses, Kronos based companies provide a Kronos Kares benefit card to GASTech employees giving them discounts and rewards in exchange for collecting information about their credit card purchases and preferences as recorded on loyalty cards. This data has been made available to investigators in the hopes that it can help resolve the situation. However, Kronos Kares does not collect personal information beyond purchases.

As a visual analytics expert assisting law enforcement, your mission is to identify which GASTech employees made which purchases and identify suspicious patterns of behavior. You must cope with uncertainties that result from missing, conflicting, and imperfect data to make recommendations for further investigation.

Questions

  1. Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.

  2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

  3. Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.

  4. Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

  5. Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.

#Scope of work

To address the questions above, below is the required scope of work 1. Literature Review 2. Data preparation 3. Methodology and visualisation 4. Answers 5. Acknowledgment

1. Literature Review

This year Vast Challenge uses the same question based on 2014 Vast Challenge. As such, we will conduct literature review on past 2014 Vast Challenge submission based on MC2 – Patterns of Life Analysis. We will select some of the past submission to identify some of the gaps that can potentially utilized interactive data visualization techniques to improve user experiences. The repository for 2014 Vast Challenge – MC2: Patterns of Life Analysis submissions can be found here

1.1 University of Buenos Aries - Arcaya.

The localization of different people based on bar chart is difficult to visualize as there are too many people with indicated by different type of colours and there are too many locations thereby squeezing the bar chart altogether. A recommendation is to split location up into 3-4 groups so that it will be easier to visualize.

1.2 Shandong University

The price vs consumption area by using density plot is not a very good idea as it is very confusing to know which employee is which as there are too many employees squeeze into one chart. There is no legend or tooltip to indicate the employees.

1.3. ASTRI

The bar chart used to indicate less frequented location is too squeezy especially those portion with many departments squeeze into one location. These makes it difficult for people to analyse. A recommendation is to group the data into lesser locations and possibly provided some interactivity where user can click the legend to select the department they want.

2. Data preparation

2.1 Customisation of code chunks

First, we will customize the all code chunks using the below knitr code. More information on chunk options can be found here

2.2 Installing required R packages

Next, we will install the required R packages. There are three basic groups of packages that we will install,

  1. For data manipulation and preparation The tidyverse package is a group of R packages including dplyr, tidyr that assist user to manipulate data.

  2. Date and Time The two packages clock and lubridate are used for the manipulation of date and time data

  3. Interactive data analysis The two packages ggiraph and plotly are used to output data into interactive graphical/chart form for analysis.

  4. GeoVisual Analysis The packages raster, sf,tmap and rgdal are used for geospatial visual analytics where data are output to a map for analysis such as movement of people etc.

  5. Network Analysis Lastly, to analyse the relationship between people etc. We will use the packages from tidygraph, igraph, ggraph and visNetwork. visNetwork is a package to output interactive network analysis.

packages = c('DT','ggiraph','plotly','tidyverse', 'raster','sf','clock','tmap',
             'rgdal','dplyr', 'tidyr', 'textclean', "plotly", "forcats", "jpeg", "tiff",
             "mapview","tidygraph","igraph","ggraph","visNetwork","leaflet","lubridate")
for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

##2.3 Provided data and information There are 3 different types of data & information provided. A geospatial dataset, csv files and a jpg file consisting of the Abila tourist map.

Below is the information of all the data provided.

2.3.1. A csv file on vehicle assignments to employee (car-assignments.csv)

  1. Employee Last Name

  2. Employee First Name

  3. Car ID (integer)

  4. Current Employment Type (Department; categorical)

  5. Current Employment Title (job title; categorical)

2.3.2. A CSV file of vehicle tracking data (gps.csv)

3.1 Timestamp

3.2 Car ID (integer)

3.3 Latitude

3.4 Longitude

2.3.3. A CSV file containing loyalty card transaction data (loyalty_data.csv)

4.1 Timestamp

4.2 Location (name of the business)

4.3 Price (real)

4.4 Loyalty Number (A 5-character code starting with L that is unique for each card)

2.3.4. A CSV file containing credit and debit card transaction data (cc_data.csv)

5.1 Timestamp

5.2 Location (name of the business)

5.3 Price (real)

5.4 Last 4 digits of the credit or debit card number

2.3.5. ESRI shapefiles of Abila (in the Geospatial folder)

2.3.6. A tourist map of Abila with locations of interest identified, in JPEG format (MC2-Tourist.jpg)

2.4. Importing of data

We will import the 4 different csv datasets that were provided

car <- read_csv("data/mc2/car-assignments.csv")
cc <- read_csv("data/mc2/cc_data.csv")
gps <- read_csv("data/mc2/gps.csv")
loyalty <- read_csv("data/mc2/loyalty_data.csv")

If we take a look at the above datasets in excel, we will see that there are foreign characters in some of the datasets provided. An example is the Katerina’s Cafe as shown below. To address this, we will need to encode the dataset to allow rstudio to read properly.

To allow use to know the encoding type for both cc and loyalty dataset. Guess encoding will be used to detect the encoding type as shown below..

guess_encoding(cc)
# A tibble: 1 x 2
  encoding confidence
  <chr>         <dbl>
1 ASCII             1
guess_encoding(loyalty)
# A tibble: 1 x 2
  encoding confidence
  <chr>         <dbl>
1 ASCII             1

Next, we will add the code locale = locale(encoding = “ASCII” on both cc and loyalty dataset)

car <- read_csv("data/mc2/car-assignments.csv")
cc <- read_csv("data/mc2/cc_data.csv", locale = locale(encoding = "ASCII"))
gps <- read_csv("data/mc2/gps.csv")
loyalty <- read_csv("data/mc2/loyalty_data.csv",locale = locale(encoding = "ASCII"))

2.5. Data examination

First, we will look at both cc and loyalty card dataset by using the glimpse function as shown below. There are 1490 rows and 1392 rows in both the cc and loyatly dataset respectfully. If we take look closely, we will see that these two datasets are closely linked by location, price and timestamp except the last4ccnum and loyaltynum are different.

If we take a look back at the MC2 background, we will observe that Kronos based companies are allowed to collect credit card and loyalty cards information on GAStech employees purchases as such these two datasets are similar in nature.

However, the rows for both cc and loyalty card data are different. This anomaly might have a few reasoning, 1) the employees did not used their credit cards while doing purchases but loyalty card was presented. 2) Vice versa, employees might also used their credit card but did not present their loyalty card during purchases.

glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp  <chr> "1/6/2014 7:28", "1/6/2014 7:34", "1/6/2014 7:35"~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty)
Rows: 1,392
Columns: 4
$ timestamp  <chr> "1/6/2014", "1/6/2014", "1/6/2014", "1/6/2014", "~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~

Next, we will look at the vehicle datasets. The 44 rows car datasets are represented by the employee, employment information with their car assignment ID.

The gps datasets are based on the car movements in respect to their lat and long position with timestamp.

glimpse(car)
Rows: 44
Columns: 5
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "1/6/2014 6:28", "1/6/2014 6:28", "1/6/2014 6:28",~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

2.6. Data Preparation

2.6.1 CC and Loyalty dataset

We will prepare the cc and loyalty datasets for data exploration later.

2.6.1.1. Changing the datatypes of last4ccnum & loyaltynum

The last4ccnum of the cc datasets and the loyaltynum of the loyalty dataset should be a categorical data type. As such, we will change it by using the as.factor function.

cc$last4ccnum <- as.factor(cc$last4ccnum)
loyalty$loyaltynum <- as.factor(loyalty$loyaltynum)

Next, we will modify the datatype for both the timestamp of cc and loyalty dataset using the clock package. If we observe below, the data_time_parse function is use to change the timestamp to dttm (datetime) format while the date_parse function is used to change the data to date format.

cc$timestamp <- date_time_parse(cc$timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M")

loyalty$timestamp <- date_parse(loyalty$timestamp,
                                 format = "%m/%d/%Y")

We will double check the dataset to confirm that the datatype has been changed to the one we wanted.

glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty)
Rows: 1,392
Columns: 4
$ timestamp  <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~

2.6.1.2 Splitting the timestamp data into individual date, time and day columns.

To allow us to dive deeper into our analysis later, we will split the timestamp dataset into date, time and day columns. We will first add more columns by using the mutate function from dplyr to add day, date and time columns in the cc dataset and importing in to the cc_dtsplit object.

There is no need to add columns for the loyatly dataset as we will join both the datasets together in which day will be included in the joined dataset.

cc_dtsplit <- cc %>%
  mutate(day = date_weekday_factor(cc$timestamp), date =  as_date(cc$timestamp), time = format(cc$timestamp, format = "%H:%M"))

cc_dtsplit
# A tibble: 1,490 x 7
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~ 11.3  4795       Mon   2014-01-06
 2 2014-01-06 07:34:00 Hallowed Gro~ 52.2  7108       Mon   2014-01-06
 3 2014-01-06 07:35:00 Brew've Been~  8.33 6816       Mon   2014-01-06
 4 2014-01-06 07:36:00 Hallowed Gro~ 16.7  9617       Mon   2014-01-06
 5 2014-01-06 07:37:00 Brew've Been~  4.24 7384       Mon   2014-01-06
 6 2014-01-06 07:38:00 Brew've Been~  4.17 5368       Mon   2014-01-06
 7 2014-01-06 07:42:00 Coffee Camel~ 28.7  7253       Mon   2014-01-06
 8 2014-01-06 07:43:00 Brew've Been~  9.6  4948       Mon   2014-01-06
 9 2014-01-06 07:43:00 Brew've Been~ 16.9  9683       Mon   2014-01-06
10 2014-01-06 07:47:00 Hallowed Gro~ 16.5  8129       Mon   2014-01-06
# ... with 1,480 more rows, and 1 more variable: time <chr>

To allow us to join both datasets together, we will need to rename the timestamp column from the loyatly dataset to date so that both the date columns have the same name.

loyalty_dt <- rename(loyalty, date = timestamp)
We will take a look at our newly cleaned datasets to double check the changed we have made previously.
glimpse(cc_dtsplit)
Rows: 1,490
Columns: 7
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ time       <chr> "07:28", "07:34", "07:35", "07:36", "07:37", "07:~
glimpse(loyalty_dt)
Rows: 1,392
Columns: 4
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~

2.6.1.3. Changing the datatypes of car & gps

First, we will take a look at the car & gps datasets. Notice that the CarID and id for both datasets are not of the correct categorical datatype. We will proceed to change both the two columns.

glimpse(car)
Rows: 44
Columns: 5
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "1/6/2014 6:28", "1/6/2014 6:28", "1/6/2014 6:28",~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

Changing of CarID and ID to categorical data.

car$CarID = as.factor(car$CarID)
gps$id = as.factor(gps$id)

2.6.1.4. Combining both first and last name.

Next, we will combine both first and last name of the car datasets into one column for us to analyse the person easily. We will use the tidyr function unite to unify both the first and last name into name.

car_unite <- car %>%
  unite(col = "name", LastName,FirstName, sep = ", ",  remove =FALSE) 

2.6.1.5. Change datatype of time and rename id to Carid

Next, we will rename the id of gps to CarID to match with the car_unite data. Additionally, the Timestamp data of gps will be changed to the dttm format

gps_cleaned <- rename(gps,CarID = id)

gps_cleaned$Timestamp <- date_time_parse(gps_cleaned$Timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M")

Lastly, we will look at our cleaned dataset and we have done cleaning the data.

glimpse(car_unite)
Rows: 44
Columns: 6
$ name                   <chr> "Calixto, Nils", "Azada, Lars", "Bala~
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps_cleaned)
Rows: 685,169
Columns: 4
$ Timestamp <dttm> 2014-01-06 06:28:00, 2014-01-06 06:28:00, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

3. Methodology and Visualisation methods

Below we will discuss some methodology and visualization methods used to answer the questions later.

3.1. Bar Chart

First, we will be using a bar chart to analyse the number of patrons to each location. A bar chart is very useful to analyse categorical data based on number of instances.

The code uses both tidyverse and plotly packages to create the bar chart. First, a count was used to count each location by visits. Then a mutate function was used to create and sort the location by count in descending order. Next, a plotly bar chart was created by indicating the x and y axis with additional layout for the chart.

The plotly bar chart is accessible later in the question segment.

line chart

To visualize timeseries data, a time-series line chart is great for analysis.

The below line chart uses the top few most visited places to plot into a timeseries line.

Geospatial visualization

Next, a geospatial map visualisation is created to analyse the movement of each vehicle.

Network analysis

An interactive network data analysis is created to visualise the realtionship between each person.

Notice that the carid can be selected to see its relationship between other carid. Those that are nearest and highlighted are carid that have a close relationship with the person of interest.

4. Answers

As shown previosuly in our data preparation segment, there is a difference in rows (difference of 98 rows) between the cc and loyalty card data. This shows an anomaly that one of the cards were used but not both during purchases which might results in a difference in number of entries.

glimpse(cc_dtsplit)
Rows: 1,490
Columns: 7
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ time       <chr> "07:28", "07:34", "07:35", "07:36", "07:37", "07:~
glimpse(loyalty_dt)
Rows: 1,392
Columns: 4
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~

As explained previously, the cc and loyalty dataset are similar in nature with just an exception of the last4ccnum and loyaltynum, therefore, a join is required to match rows where the users uses both cc and loyalty card data. However, before a join is performed, we will analyse both the cc and loyalty dataset based on location counts.

A first exploratory data analysis by using a bar chart is performed on the cc_dtsplit dataset to find out the location which were patronize the most.

Based on the bar chart below, we are able to observe that there are 4 locations that are siginificantly higher in visitors than the rest of the locations. The highest is Katerina’s cafe with a total of 212 visits followed by Hippokampos, Guys’Gyros and Brew’s been served. These four places excluding Hippokampos which is not determinable in the map provided are all restaurants/bars.

cc_dtsplit_bar <- cc_dtsplit %>%
  count(location) %>%
  mutate(location = fct_reorder(location, n, .desc =TRUE)) %>%
  plot_ly(x = ~location, y = ~n, marker = list(color = ~n)) %>%
  add_bars() %>%
  layout(title = "Total number of visitation by location", xaxis = list(title = ""),yaxis = list(title = "Number of visitors"))

cc_dtsplit_bar

Next, we will take a look at the loyalty_dt dataset.

Based on the bar chart below, we are able to observe that the 4 locations that were visited mostly are the same four locations that was shown on the cc_dtsplit dataset.

One difference is that Ouzeri Elian jumps to the fifth spot which differs from the cc_dtsplit dataset where Ouzeri Elian is in the sixth spot.

loyalty_dt_bar <- loyalty_dt %>%
  count(location) %>%
  mutate(location = fct_reorder(location, n, .desc =TRUE)) %>%
  plot_ly(x = ~location, y = ~n, marker = list(color = ~n)) %>%
  add_bars() %>%
  layout(title = "Total number of visitation by ", xaxis = list(title = ""),yaxis = list(title = "Number of visitors"))

loyalty_dt_bar

Next, an inner join will be conducted by joining date, location and price and will will output into a new object cc_loyalty_join.

Notice that there are only 1,087 rows wheareas both cc_dtsplit and loyalty_dt has 1490 and 1392 rows respectfully. These shows that some of the data in cc does not match the loyalty data which strengthens the initial anomaly that some of the employees uses one of each card but not both during purchases.
cc_loyalty_join <- cc_dtsplit %>%
  inner_join(loyalty_dt, by = c("date","location", "price"))

cc_loyalty_join
# A tibble: 1,087 x 8
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~ 11.3  4795       Mon   2014-01-06
 2 2014-01-06 07:35:00 Brew've Been~  8.33 6816       Mon   2014-01-06
 3 2014-01-06 07:36:00 Hallowed Gro~ 16.7  9617       Mon   2014-01-06
 4 2014-01-06 07:37:00 Brew've Been~  4.24 7384       Mon   2014-01-06
 5 2014-01-06 07:38:00 Brew've Been~  4.17 5368       Mon   2014-01-06
 6 2014-01-06 07:43:00 Brew've Been~  9.6  4948       Mon   2014-01-06
 7 2014-01-06 07:43:00 Brew've Been~ 16.9  9683       Mon   2014-01-06
 8 2014-01-06 07:47:00 Hallowed Gro~ 16.5  8129       Mon   2014-01-06
 9 2014-01-06 07:48:00 Hallowed Gro~ 10.7  3492       Mon   2014-01-06
10 2014-01-06 07:49:00 Coffee Camel~  8.39 5921       Mon   2014-01-06
# ... with 1,077 more rows, and 2 more variables: time <chr>,
#   loyaltynum <fct>

After joining the dataset, we found that there are 1087 rows of data available. This means that some of the credit card purchases are not linked to any of the loyalty card, vice versa, some of the loyalty card are not linked to the credit card purchases. These strengthen the anomaly that we have highlighted before. We will take a look at those datasets that does not match later.

Before we begin further analysis on further anomalies between the two datasets. We shall conduct some exploratory data analysis. First, we will analyse the most popular place visited during the duration of study.

To do the above, we will group all the location by the number of visitations. Taking a look at the raw output, we can see that the 4 most visited place is the same four as mentioned previosuly.

cc_location_count <- cc_loyalty_join %>%
  group_by(location) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

cc_location_count
# A tibble: 32 x 2
   location               count
   <chr>                  <int>
 1 "Katerina\u0012s Cafi"   155
 2 "Guy's Gyros"            121
 3 "Hippokampos"            117
 4 "Brew've Been Served"    111
 5 "Ouzeri Elian"            62
 6 "Hallowed Grounds"        60
 7 "Kalami Kafenion"         53
 8 "Abila Zacharo"           52
 9 "Gelatogalore"            48
10 "Bean There Done That"    36
# ... with 22 more rows

Next, a bar chart is produced sorted descendingly by the number of visitations.

cc_join <- cc_loyalty_join %>%
  count(location) %>%
  mutate(location = fct_reorder(location, n, .desc =TRUE)) %>%
  plot_ly(x = ~location, y = ~n, marker = list(color = ~n)) %>%
  add_bars(name = "inner-join") %>%
  layout(title = "Total number of visitation by area", xaxis = list(title = ""),yaxis = list(title = "Number of visitors"))

cc_join

From the chart above, we can observe that “Katerine”s Cafe" (155 counts), “Guy’s Gyros” (121 counts), “Hippokampos” (117 counts), “Brew’ve Been Serve” (111 counts) have the most visitation during the period of study. If we take a look closely, the 4th most visited place is actually around 2x the 5th most visited place. This shows that top 4 most visited places takes up most of places visited by patrons.

Although we have shown the overall most visited places. However, the more interesting analysis lies in when are the places visited by patrons. As such, we will group the joined data by location, date, day and time to see at which point of time was the location the most.

cc_loyalty_group <- cc_loyalty_join %>%
  group_by(location,date,day,time) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  ungroup()

cc_loyalty_group
# A tibble: 992 x 5
   location             date       day   time  count
   <chr>                <date>     <ord> <chr> <int>
 1 Bean There Done That 2014-01-13 Mon   12:00     6
 2 Bean There Done That 2014-01-06 Mon   12:00     5
 3 Bean There Done That 2014-01-10 Fri   12:00     5
 4 Bean There Done That 2014-01-17 Fri   12:00     5
 5 Bean There Done That 2014-01-16 Thu   12:00     4
 6 Guy's Gyros          2014-01-19 Sun   19:45     4
 7 Jack's Magical Beans 2014-01-13 Mon   12:00     4
 8 Jack's Magical Beans 2014-01-16 Thu   12:00     4
 9 Bean There Done That 2014-01-14 Tue   12:00     3
10 Bean There Done That 2014-01-15 Wed   12:00     3
# ... with 982 more rows

From the above table we can already observe that “Bean There Done That” was patronized the most at one point of time. Next, are “Guy’s Gyros” and “Jack’s Magical Beans”. Interestingly, “Bean There Done That” has the highest count of high visitation at one point of time.

Next, we will plot a line chart of the restaurant based on count of 3 or more based on a time series line chart to see if there is a trend. The reason to plot 3 or more is to show those location that were highly visited at one point of time.

We will first filter the dataset.

cc_loyalty_top <- cc_loyalty_group %>%
  filter(count >= 3) %>%
  ungroup()

Next, we will create the line chart.

Based on the most frequented places, we can see that January 13 clock the highest count for one day. Additonally, Bean there done that has clocked the highest visitations per day followed by Jack’s Magical Beans.

cc_timeseries <- cc_loyalty_top %>%
  plot_ly(x = ~date, y =~count, color = ~location, hoverinfo = "text",
          text = ~paste("Location:", location, "<br>","Date:", date, "<br>","Time:", time, "<br>", "Day:", day)) %>%
  add_lines() %>%
  add_markers(showlegend = FALSE) %>%
  layout(title = "Most frequented places by groups of people", xaxis = list(title = "", dtick = ~date, showgrid = FALSE), yaxis = list(title = "Number of people"))

cc_timeseries

Next, we will analyse the highest frequented location based on day and time. To do that, we will need to use this unite to combine both day and time together and group the, by location and daytime.

cc_loyalty_day <- cc_loyalty_join %>%
  unite(col = "daytime", day,time, sep = " ",  remove =FALSE) %>%
  group_by(location, daytime) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

cc_loyalty_day
# A tibble: 937 x 3
# Groups:   location [32]
   location             daytime   count
   <chr>                <chr>     <int>
 1 Bean There Done That Mon 12:00    11
 2 Bean There Done That Fri 12:00    10
 3 Jack's Magical Beans Mon 12:00     7
 4 Bean There Done That Thu 12:00     6
 5 Jack's Magical Beans Thu 12:00     6
 6 Bean There Done That Wed 12:00     5
 7 Brewed Awakenings    Fri 12:00     5
 8 Brewed Awakenings    Mon 12:00     5
 9 Jack's Magical Beans Fri 12:00     5
10 Jack's Magical Beans Wed 12:00     5
# ... with 927 more rows

Next, a bar chart is plotted by filtering out those with more than or equal to 5. From the bar chart, we are able to see that Bean there done that has the highest visitations at a point of time, this is similar to the analysis we did before. The highest is 11 counts on Mon, 12:00H, next is Fri, 12:00H with 11 counts. The nect highest visited location at a point of time is Jack’s Magical Beans which clock 7 counts on Mon, 12:00H, followed by Thu, 12:00H of 6 counts.

cc_day <- cc_loyalty_day %>%
  filter(count >=5) %>%
  plot_ly(x = ~daytime, y = ~count, color = ~location, hoverinfo = "text", text = ~paste("Location:", location, "<br>","Day Time:", daytime, "<br>", "No.of visitors:", count)) %>%
  add_bars() %>%
  layout (title = "Total counts of visitors by day and time", xaxis = list(title = "", size = 3, dtick = ~daytime), yaxis = list(title = "Number of visitors") )

cc_day

Having completed some EDA by inner joining both cc and loyalty dataset. We will sieve out those dataset that were in cc but not in loyalty. The method to do this is by using anti-join. Anti-join will take out those dataset that does not belongs to the secondary dataset but in the main dataset.

This method is similar to full join therafter sieving those NA values.

From the anti-joined data below, we are able to see that there are 409 rows that were in cc but not in loyalty. This form an anomaly that cc were used in the purchases but not the loyalty card.

cc_loyalty_antijoin <- cc_dtsplit %>%
  anti_join(loyalty_dt, by = c("date","location", "price"))


cc_loyalty_antijoin
# A tibble: 409 x 7
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:34:00 Hallowed Gr~ 5.22e1 7108       Mon   2014-01-06
 2 2014-01-06 07:42:00 Coffee Came~ 2.87e1 7253       Mon   2014-01-06
 3 2014-01-06 07:52:00 Brew've Bee~ 3.28e1 9405       Mon   2014-01-06
 4 2014-01-06 07:54:00 Hallowed Gr~ 1.69e1 7889       Mon   2014-01-06
 5 2014-01-06 07:54:00 Hallowed Gr~ 3.29e1 2681       Mon   2014-01-06
 6 2014-01-06 08:00:00 Hallowed Gr~ 9.69e0 6691       Mon   2014-01-06
 7 2014-01-06 08:14:00 Coffee Came~ 3.2 e0 2418       Mon   2014-01-06
 8 2014-01-06 12:00:00 Coffee Shack 5.15e1 7117       Mon   2014-01-06
 9 2014-01-06 12:00:00 Jack's Magi~ 2.78e1 8156       Mon   2014-01-06
10 2014-01-06 12:16:00 Abila Airpo~ 1.87e3 8642       Mon   2014-01-06
# ... with 399 more rows, and 1 more variable: time <chr>

Next, we will create a basic bar chart based on the anti join dataset.

Based on the bar chart below, we can see that similar to the inner join dataset, the four locations are Katerina’s Cafe, Hippokampos, Brew’s been served and Guy’s Gyros with the highest count of 59 clock at Katerina’s Cafe.

One major difference between this anti joined dataset versus the inner joined is that the descending gradient for each restaurant is not as steep as the inner joined. In the inner joined dataset, the 4th highest visited location is almost 2x the 5th highest visited location. As for the below chart, we can see that the decrement is not that steep.

cc_anti <- cc_loyalty_antijoin %>%
  count(location) %>%
  mutate(location = fct_reorder(location, n, .desc =TRUE)) %>%
  plot_ly(x = ~location, y = ~n, marker = list(color = ~n)) %>%
  add_bars(name = "anti-join") %>%
  layout(title = "Total number of visitation by area", xaxis = list(title = "", automargin = TRUE),yaxis = list(title = "Number of visitors", automargin = TRUE))

cc_anti

Next, we will group the anti joined dataset to extract the number of visitations at a point of time. Based on the raw grouped data below, we can see that the count of visitations is quite linear with no high variations observed.

cc_loyalty_anti_group <- cc_loyalty_antijoin %>%
  group_by(location, day, date, time) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  ungroup()

cc_loyalty_anti_group
# A tibble: 391 x 5
   location             day   date       time  count
   <chr>                <ord> <date>     <chr> <int>
 1 Bean There Done That Tue   2014-01-07 12:00     3
 2 Brewed Awakenings    Tue   2014-01-07 12:00     3
 3 Jack's Magical Beans Thu   2014-01-09 12:00     3
 4 Bean There Done That Tue   2014-01-14 12:00     2
 5 Bean There Done That Wed   2014-01-08 12:00     2
 6 Brew've Been Served  Fri   2014-01-10 07:56     2
 7 Guy's Gyros          Tue   2014-01-07 20:10     2
 8 Hallowed Grounds     Mon   2014-01-06 07:54     2
 9 Hallowed Grounds     Wed   2014-01-08 07:39     2
10 Hallowed Grounds     Thu   2014-01-16 07:51     2
# ... with 381 more rows

Using the group data, we will create a time-series chart to see the trend throughout the duration of study. We will output those data with count of more than or equal to 2.

Based on the chart below, we can observe that Bean there done that and Jack’s Magical Beans at one point has the highest visitations of 3 transactions. The rest of the location are quite constant with 2 transactions.

 cc_anti_timeseries <- cc_loyalty_anti_group %>%
  filter(count >= 2) %>%
  plot_ly(x = ~date, y =~count, color = ~location, hoverinfo = "text",
          text = ~paste("Location:", location, "<br>","Date:", date, "<br>","Time:", time, "<br>", "Day:", day)) %>%
  add_lines() %>%
  add_markers(showlegend = FALSE) %>%
  layout(title = "Most frequented places by groups of people", xaxis = list(title = "", dtick = ~date, showgrid = FALSE), yaxis = list(title = "Number of people"))

cc_anti_timeseries

Similar to the above inner joined dataset, we will create a daytime column to observe which daytime has the highest count.

cc_loyalty_day_anti <- cc_loyalty_antijoin %>%
  unite(col = "daytime", day,time, sep = " ",  remove =FALSE) %>%
  group_by(location, daytime) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

cc_loyalty_day_anti
# A tibble: 379 x 3
# Groups:   location [31]
   location             daytime   count
   <chr>                <chr>     <int>
 1 Bean There Done That Tue 12:00     5
 2 Jack's Magical Beans Tue 12:00     4
 3 Bean There Done That Wed 12:00     3
 4 Brewed Awakenings    Tue 12:00     3
 5 Hallowed Grounds     Thu 07:51     3
 6 Hallowed Grounds     Wed 07:39     3
 7 Jack's Magical Beans Thu 12:00     3
 8 Jack's Magical Beans Wed 12:00     3
 9 Bean There Done That Thu 12:00     2
10 Brew've Been Served  Fri 07:56     2
# ... with 369 more rows

Based on the chart below, we can see that the highest count per daytime is 5 counts of visitors at Bean there done that which was clocked on Tue, 12:00pm followed by Jack’s Magical bean of 4 counts on the same day.

cc_anti_day <- cc_loyalty_day_anti %>%
  filter(count >=3) %>%
  plot_ly(x = ~daytime, y = ~count, color = ~location, hoverinfo = "text", text = ~paste("Location:", location, "<br>","Day Time:", daytime, "<br>", "No.of visitors:", count)) %>%
  add_bars() %>%
  layout (title = "Total counts of visitors by day and time", xaxis = list(title = "", size = 3, dtick = ~daytime), automargin = TRUE, yaxis = list(title = "Number of visitors"))

cc_anti_day

Next, we will anti join from loyalty to cc.

Based on the anti-joined data, we observe that there are 311 rows of transaction which uses loyalty card with no credit card.

cc_loyalty_antijoin_right <- loyalty_dt%>%
  anti_join(cc_dtsplit, by = c("date","location", "price")) %>%
  ungroup()

cc_loyalty_antijoin_right
# A tibble: 311 x 4
   date       location                    price loyaltynum
   <date>     <chr>                       <dbl> <fct>     
 1 2014-01-06 "Coffee Shack"              11.5  L6417     
 2 2014-01-06 "Hallowed Grounds"          12.9  L1107     
 3 2014-01-06 "Abila Zacharo"             26.9  L9018     
 4 2014-01-06 "Katerina\u0012s Cafi"      38.6  L6110     
 5 2014-01-06 "Frydos Autosupply n' More" 89.3  L4034     
 6 2014-01-07 "Hallowed Grounds"           8.42 L1107     
 7 2014-01-07 "Coffee Cameleon"           12.4  L9018     
 8 2014-01-07 "Ouzeri Elian"              17.7  L2247     
 9 2014-01-07 "Katerina\u0012s Cafi"      25.0  L2343     
10 2014-01-08 "Brew've Been Served"       14.9  L2247     
# ... with 301 more rows

Notice that there is no day after the anti join was performed this is due to the fact that originally, the loyalty_dt dataset does not have the day included. As such, day is being created using the date_weekday_factor function.

cc_loyalty_antijoin_mutate <- cc_loyalty_antijoin_right %>%
  mutate(day = date_weekday_factor(cc_loyalty_antijoin_right$date))
cc_loyalty_antijoin_mutate
# A tibble: 311 x 5
   date       location                    price loyaltynum day  
   <date>     <chr>                       <dbl> <fct>      <ord>
 1 2014-01-06 "Coffee Shack"              11.5  L6417      Mon  
 2 2014-01-06 "Hallowed Grounds"          12.9  L1107      Mon  
 3 2014-01-06 "Abila Zacharo"             26.9  L9018      Mon  
 4 2014-01-06 "Katerina\u0012s Cafi"      38.6  L6110      Mon  
 5 2014-01-06 "Frydos Autosupply n' More" 89.3  L4034      Mon  
 6 2014-01-07 "Hallowed Grounds"           8.42 L1107      Tue  
 7 2014-01-07 "Coffee Cameleon"           12.4  L9018      Tue  
 8 2014-01-07 "Ouzeri Elian"              17.7  L2247      Tue  
 9 2014-01-07 "Katerina\u0012s Cafi"      25.0  L2343      Tue  
10 2014-01-08 "Brew've Been Served"       14.9  L2247      Wed  
# ... with 301 more rows

Similar to the previous inner join and left anti join data. The four highest visited locations are Katerina’s Cafe, Hippokampos, Brew’s been served and Guy’s Gyros.

Katerina’s Cafe and Hippokampos lead the two most visited location of 42 and 40 respectively.

cc_anti_right <- cc_loyalty_antijoin_mutate %>%
  count(location) %>%
  mutate(location = fct_reorder(location, n, .desc =TRUE)) %>%
  plot_ly(x = ~location, y = ~n, marker = list(color = ~n)) %>%
  add_bars(name = "anti-join-right") %>%
  layout(title = "Total number of visitation by area", xaxis = list(title = "", automargin = TRUE),yaxis = list(title = "Number of visitors", automargin = TRUE))

cc_anti_right

Next, we will group the right anti joined dataset to get the count per day based on location.

From the below chart, we can see that Hallowed Grounds have the highest count of 6 visitations on 2014-01-09.

cc_loyalty_day_anti_right <- cc_loyalty_antijoin_mutate %>%
  group_by(location, date,day) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  ungroup()

cc_loyalty_day_anti_right
# A tibble: 161 x 4
   location               date       day   count
   <chr>                  <date>     <ord> <int>
 1 "Hallowed Grounds"     2014-01-09 Thu       6
 2 "Brew've Been Served"  2014-01-08 Wed       5
 3 "Brew've Been Served"  2014-01-09 Thu       5
 4 "Brew've Been Served"  2014-01-10 Fri       5
 5 "Guy's Gyros"          2014-01-15 Wed       5
 6 "Hippokampos"          2014-01-10 Fri       5
 7 "Katerina\u0012s Cafi" 2014-01-13 Mon       5
 8 "Katerina\u0012s Cafi" 2014-01-14 Tue       5
 9 "Katerina\u0012s Cafi" 2014-01-16 Thu       5
10 "Abila Zacharo"        2014-01-13 Mon       4
# ... with 151 more rows

Next, we will created a bar chart to analyse those visitations of more than or equal to 4 counts. Based on the chart below, we can see that although Hallowed Grounds has the highest visitation at one point, however, it only clock once. This is also considered another anomaly as the count of visitations is not consistent in the dataset provided.

The next highest count of 5 visitations were clocked by Guys’s Gyros on 2014-01-15, Hippokampos on 2014-01-10, Katerina’s Cafe and Brew’s been served on three different occasions.

cc_anti_day_right <- cc_loyalty_day_anti_right %>%
  unite(col = "daydate", day, date, sep = " ", remove = FALSE) %>%
  filter(count >=4) %>%
  plot_ly(x = ~daydate, y = ~count, color = ~location, hoverinfo = "text", text = ~paste("Location:", location, "<br>","DayDate:", daydate, "<br>", "No.of visitors:", count)) %>%
  add_bars() %>%
  layout (title = "Total counts of visitors by date", xaxis = list(title = "", size = 3, dtick = ~daydate), automargin = TRUE, yaxis = list(title = "Number of visitors"))

cc_anti_day_right

Next, we will created a time-series chart to see the trends of visitations to the respective locations.

One more peculiar occurrence can be discovered from the chart below. Brew’s Been Served has 5 transactions consecutively for 3 days. However, after which no visits were recorded of Brew’ve been served for right anti join of more than or equal to 4 transactions.

 cc_anti_timeseries_right <- cc_loyalty_day_anti_right %>%
  filter(count >= 4) %>%
  plot_ly(x = ~date, y =~count, color = ~location, hoverinfo = "text",
          text = ~paste("Location:", location, "<br>","Date:", date, "<br>", "<br>", "Day:", day)) %>%
  add_lines() %>%
  add_markers(showlegend = FALSE) %>%
  layout(title = "Most frequented places by groups of people", xaxis = list(title = "", dtick = ~date, showgrid = FALSE), yaxis = list(title = "Number of people"))

cc_anti_timeseries_right

Next, we will create a facet grid of all the three joined dataset to analyse everything at one go. To create a facet grid, we will use the subplot function based on the plotly package to create three chart side by side for analysis.

subplot(cc_join, cc_anti, cc_anti_right, nrows = 1, shareX = TRUE, shareY = TRUE)

Lastly, we will the cc_loyalty_join dataset with itself by “last4ccnum”,“location”,“price”, “time”, “date” and filter by different loyaltynum. The cc_loyalty dataset is the initial inner join dataset.

Based on the below tibble. We can see that there are 12 instances where there are 6 cc users using different loyaltynum. This is another anomaly that has been found.
cc_loyalty_join %>% 
  inner_join(cc_loyalty_join, by = c("last4ccnum", "location", "price","time", "date")) %>%
  filter(loyaltynum.x != loyaltynum.y)
# A tibble: 12 x 11
   timestamp.x         location      price last4ccnum day.x date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-09 19:30:00 "Katerina\u0~ 26.6  5921       Thu   2014-01-09
 2 2014-01-09 19:30:00 "Katerina\u0~ 26.6  5921       Thu   2014-01-09
 3 2014-01-09 20:06:00 "Katerina\u0~ 26.6  4948       Thu   2014-01-09
 4 2014-01-09 20:06:00 "Katerina\u0~ 26.6  4948       Thu   2014-01-09
 5 2014-01-09 20:23:00 "Guy's Gyros"  8.23 7889       Thu   2014-01-09
 6 2014-01-09 20:23:00 "Guy's Gyros"  8.23 7889       Thu   2014-01-09
 7 2014-01-09 20:38:00 "Guy's Gyros"  8.23 5368       Thu   2014-01-09
 8 2014-01-09 20:38:00 "Guy's Gyros"  8.23 5368       Thu   2014-01-09
 9 2014-01-11 19:29:00 "Hippokampos" 63.2  4795       Sat   2014-01-11
10 2014-01-11 19:29:00 "Hippokampos" 63.2  4795       Sat   2014-01-11
11 2014-01-11 19:45:00 "Hippokampos" 63.2  8332       Sat   2014-01-11
12 2014-01-11 19:45:00 "Hippokampos" 63.2  8332       Sat   2014-01-11
# ... with 5 more variables: time <chr>, loyaltynum.x <fct>,
#   timestamp.y <dttm>, day.y <ord>, loyaltynum.y <fct>

4.2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

We will examine the two vehicle related data. Gps and car-assignments data.

The car-assignment data with a total of 44 rows consists of the name and appointment of the employee tag to a CarID.

The GPS data with a total of 685169 rows consists of the id of the car and its movement based on latitude and longitude with timestamp.
glimpse(car_unite)
Rows: 44
Columns: 6
$ name                   <chr> "Calixto, Nils", "Azada, Lars", "Bala~
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps_cleaned)
Rows: 685,169
Columns: 4
$ Timestamp <dttm> 2014-01-06 06:28:00, 2014-01-06 06:28:00, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

In order for us to match the car assignment person to the gps data. We will inner join the data.

Notice that there is only a total of 613077 dataset. This shows that some of the vehicles recorded are not part of the car_assignment data. I.e. vehicles apart from the car assignees has been tracked too. This phenomena wil be explained in the next section.

gps_car <- gps_cleaned %>%
  inner_join(car_unite, by = "CarID") 


gps_car
# A tibble: 613,077 x 9
   Timestamp           CarID   lat  long name       LastName FirstName
   <dttm>              <fct> <dbl> <dbl> <chr>      <chr>    <chr>    
 1 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 2 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 3 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 4 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 5 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 6 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 7 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 8 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 9 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
10 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
# ... with 613,067 more rows, and 2 more variables:
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>

The car_unite dataset which consists of car assigned to GASTech employees and also truck drivers which have no CarID.

DT::datatable(car_unite)

Comparing the above car_unite DT table with the below gps_cleaned table, you will observe that there are vehicle IDs of 101-107. Those are assume to be trucks.

DT::datatable(gps_cleaned)

If we look at the above data, we can see that during one minute, the gps data varies a lot. This tell us that during one minute, the car moves around quite a fair bit.

Next, we will anti join both the car and gps datasets to sieve out those vehicle movement that are not part of the car assignees data.

The below dataset will clearly show those CarID of 101-107 which are assume to be trucks.

Going back to the background of this assignment. We will observe that GAStech do provide trucks for official business use. As such, we will assume that those vehicle with CarID of 101 to 107 are trucks.

These trucks will later be used for examination to see if they were used for personal used.

gps_car_anti <- gps_cleaned %>%
  anti_join(car_unite, by = "CarID")

gps_car_anti
# A tibble: 72,092 x 4
   Timestamp           CarID   lat  long
   <dttm>              <fct> <dbl> <dbl>
 1 2014-01-06 07:36:00 101    36.0  24.9
 2 2014-01-06 07:36:00 101    36.0  24.9
 3 2014-01-06 07:36:00 101    36.0  24.9
 4 2014-01-06 07:36:00 101    36.0  24.9
 5 2014-01-06 07:36:00 101    36.0  24.9
 6 2014-01-06 07:36:00 101    36.0  24.9
 7 2014-01-06 07:36:00 101    36.0  24.9
 8 2014-01-06 07:36:00 101    36.0  24.9
 9 2014-01-06 07:36:00 101    36.0  24.9
10 2014-01-06 07:36:00 101    36.0  24.9
# ... with 72,082 more rows

We will revisit the cc_loyalty dataset

cc_loyalty_join %>%
  arrange(timestamp)
# A tibble: 1,087 x 8
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~ 11.3  4795       Mon   2014-01-06
 2 2014-01-06 07:35:00 Brew've Been~  8.33 6816       Mon   2014-01-06
 3 2014-01-06 07:36:00 Hallowed Gro~ 16.7  9617       Mon   2014-01-06
 4 2014-01-06 07:37:00 Brew've Been~  4.24 7384       Mon   2014-01-06
 5 2014-01-06 07:38:00 Brew've Been~  4.17 5368       Mon   2014-01-06
 6 2014-01-06 07:43:00 Brew've Been~  9.6  4948       Mon   2014-01-06
 7 2014-01-06 07:43:00 Brew've Been~ 16.9  9683       Mon   2014-01-06
 8 2014-01-06 07:47:00 Hallowed Gro~ 16.5  8129       Mon   2014-01-06
 9 2014-01-06 07:48:00 Hallowed Gro~ 10.7  3492       Mon   2014-01-06
10 2014-01-06 07:49:00 Coffee Camel~  8.39 5921       Mon   2014-01-06
# ... with 1,077 more rows, and 2 more variables: time <chr>,
#   loyaltynum <fct>

If we take a look at both the joined dataset of gps_car and cc_loyalty_join data, we will observe that the gps_car data starts at 2014-01-06, 06:28 while the cc_loyalty_join starts at 2014_01_06, 07:28. A difference of an hour. We will observe the 1 hr difference in the later part of our study.

tail(gps_car,6)
# A tibble: 6 x 9
  Timestamp           CarID   lat  long name        LastName FirstName
  <dttm>              <fct> <dbl> <dbl> <chr>       <chr>    <chr>    
1 2014-01-19 20:56:00 30     36.1  24.9 Resumir, F~ Resumir  Felix    
2 2014-01-19 20:56:00 30     36.1  24.9 Resumir, F~ Resumir  Felix    
3 2014-01-19 20:56:00 30     36.1  24.9 Resumir, F~ Resumir  Felix    
4 2014-01-19 20:56:00 30     36.1  24.9 Resumir, F~ Resumir  Felix    
5 2014-01-19 20:56:00 30     36.1  24.9 Resumir, F~ Resumir  Felix    
6 2014-01-19 20:56:00 30     36.1  24.9 Resumir, F~ Resumir  Felix    
# ... with 2 more variables: CurrentEmploymentType <chr>,
#   CurrentEmploymentTitle <chr>

Next, we will take a look at the last few rows of both joined dataset. We will see that the gps data ends at 2014-01-19, 20:56:00 while the cc_loyalty data stops at 2014-01-19, 20:51:00. There is a difference of 6 mins.

tail(cc_loyalty_join, 6)
# A tibble: 6 x 8
  timestamp           location price last4ccnum day   date       time 
  <dttm>              <chr>    <dbl> <fct>      <ord> <date>     <chr>
1 2014-01-19 20:16:00 "Hippok~  79.0 1415       Sun   2014-01-19 20:16
2 2014-01-19 20:16:00 "Guy's ~  37.5 9683       Sun   2014-01-19 20:16
3 2014-01-19 20:22:00 "Kateri~  38.9 9617       Sun   2014-01-19 20:22
4 2014-01-19 20:25:00 "Guy's ~  20.9 3853       Sun   2014-01-19 20:25
5 2014-01-19 20:30:00 "Guy's ~  31.9 4434       Sun   2014-01-19 20:30
6 2014-01-19 20:51:00 "Guy's ~  39.6 6901       Sun   2014-01-19 20:51
# ... with 1 more variable: loyaltynum <fct>

Now, we will filter out those data before 2014-01-06 07:28:00 to take a look at the CarID. First will filter out those data that are before 07:00:00 so that when we plot the bar chart later, it will be visible

gps_car_filter <- gps_car %>%
  filter(Timestamp < "2014-01-06 07:00:00")

gps_car_filter
# A tibble: 379 x 9
   Timestamp           CarID   lat  long name       LastName FirstName
   <dttm>              <fct> <dbl> <dbl> <chr>      <chr>    <chr>    
 1 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 2 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 3 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 4 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 5 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 6 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 7 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 8 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
 9 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
10 2014-01-06 06:28:00 35     36.1  24.9 Vasco-Pai~ Vasco-P~ Willem   
# ... with 369 more rows, and 2 more variables:
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>

First, we will group the data by TimeStamp and CarID.

gps_car_group <- gps_car_filter %>%
  group_by(Timestamp, CarID) %>%
  summarize(count = n()) %>%
  ungroup() 

Based on the bar chart below, there are 4 CarIDs recorded before 2014-01-06, 07:00:00. The 4 CarIDs are 4,10,19,35 with higher movements from 4 and 35.

gps_car_group %>%
  plot_ly(x = ~Timestamp, y = ~count, color = ~CarID, hoverinfo = "text",
          text = ~paste("CarID:", CarID, "<br>","Timestamp:", Timestamp, "<br>", "Count:", count)) %>%
  add_bars %>%
  layout(xaxis = list(title  = ""))
Next, we will filter out 07:00:00 - 07:12:00.
gps_car_filter2 <- gps_car %>%
  filter(Timestamp > "2014-01-06 07:00:00" & Timestamp <= "2014-01-06 07:12:00")

gps_car_filter2
# A tibble: 1,407 x 9
   Timestamp           CarID   lat  long name      LastName  FirstName
   <dttm>              <fct> <dbl> <dbl> <chr>     <chr>     <chr>    
 1 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
 2 2014-01-06 07:01:00 10     36.1  24.9 Campo-Co~ Campo-Co~ Ada      
 3 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
 4 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
 5 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
 6 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
 7 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
 8 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
 9 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
10 2014-01-06 07:01:00 7      36.1  24.9 Orilla, ~ Orilla    Elsa     
# ... with 1,397 more rows, and 2 more variables:
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>
gps_car_group2 <- gps_car_filter2 %>%
  group_by(Timestamp, CarID) %>%
  summarize(count = n()) %>%
  ungroup() 

From 07:05:00 to 07:12:00, we can see that 35 has been moving about a lot, together with the earlier recorded movement. 35 who is an executive of GASTEch moves the most. Followed by 7 and 10 which has high amount of movement per minute but only travelled for around 5 mins.

gps_car_group2 %>%
  plot_ly(x = ~Timestamp, y = ~count, color = ~CarID, hoverinfo = "text",
          text = ~paste("CarID:", CarID, "<br>","Timestamp:", Timestamp, "<br>", "Count:", count)) %>%
  add_bars %>%
  layout(xaxis = list(title  = ""))

Next, we will filter for 07:12:00 to 07:28:00

gps_car_filter3 <- gps_car %>%
  filter(Timestamp > "2014-01-06 07:12:00" & Timestamp < "2014-01-06 07:28:00")

gps_car_filter3
# A tibble: 1,699 x 9
   Timestamp           CarID   lat  long name       LastName FirstName
   <dttm>              <fct> <dbl> <dbl> <chr>      <chr>    <chr>    
 1 2014-01-06 07:13:00 18     36.1  24.9 Frente, B~ Frente   Birgitta 
 2 2014-01-06 07:13:00 35     36.0  24.9 Vasco-Pai~ Vasco-P~ Willem   
 3 2014-01-06 07:13:00 20     36.1  24.9 Fusil, St~ Fusil    Stenig   
 4 2014-01-06 07:13:00 35     36.0  24.9 Vasco-Pai~ Vasco-P~ Willem   
 5 2014-01-06 07:13:00 18     36.1  24.9 Frente, B~ Frente   Birgitta 
 6 2014-01-06 07:13:00 20     36.1  24.9 Fusil, St~ Fusil    Stenig   
 7 2014-01-06 07:13:00 35     36.0  24.9 Vasco-Pai~ Vasco-P~ Willem   
 8 2014-01-06 07:13:00 18     36.1  24.9 Frente, B~ Frente   Birgitta 
 9 2014-01-06 07:13:00 20     36.1  24.9 Fusil, St~ Fusil    Stenig   
10 2014-01-06 07:13:00 35     36.0  24.9 Vasco-Pai~ Vasco-P~ Willem   
# ... with 1,689 more rows, and 2 more variables:
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>
gps_car_group3 <- gps_car_filter3 %>%
  group_by(Timestamp, CarID) %>%
  summarize(count = n()) %>%
  ungroup() 

Similar as before, 35 is still moving about out till 07:27:00 with the highest movement count per minute.

gps_car_group3 %>%
  plot_ly(x = ~Timestamp, y = ~count, color = ~as.factor(CarID), hoverinfo = "text",
          text = ~paste("CarID:", CarID, "<br>","Timestamp:", Timestamp, "<br>", "Count:", count)) %>%
  add_bars %>%
  layout(xaxis = list(title  = ""))

Next, we will examine the last 6 min of the gps dataset

gps_car_filter4 <- gps_car %>%
  filter(Timestamp > "2014-01-19 20:51:00")
gps_car_filter4
# A tibble: 21 x 9
   Timestamp           CarID   lat  long name       LastName FirstName
   <dttm>              <fct> <dbl> <dbl> <chr>      <chr>    <chr>    
 1 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 2 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 3 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 4 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 5 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 6 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 7 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 8 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
 9 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
10 2014-01-19 20:56:00 30     36.1  24.9 Resumir, ~ Resumir  Felix    
# ... with 11 more rows, and 2 more variables:
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>

We will see that CarID 30 is the last recorded before the data ceased recording. CarID 30 belongs to the GASTech security manager.

gps_car_group4 <- gps_car_filter4 %>%
  group_by(Timestamp, CarID) %>%
  summarize(count = n()) %>%
  ungroup() 

gps_car_group4
# A tibble: 1 x 3
  Timestamp           CarID count
  <dttm>              <fct> <int>
1 2014-01-19 20:56:00 30       21
cc_gps <- cc_loyalty_join %>%
  inner_join(gps_car, by = c("timestamp" = "Timestamp"))

cc_gps
# A tibble: 123,687 x 16
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 2 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 3 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 4 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 5 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 6 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 7 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 8 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
 9 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
10 2014-01-06 07:28:00 Brew've Been~  11.3 4795       Mon   2014-01-06
# ... with 123,677 more rows, and 10 more variables: time <chr>,
#   loyaltynum <fct>, CarID <fct>, lat <dbl>, long <dbl>, name <chr>,
#   LastName <chr>, FirstName <chr>, CurrentEmploymentType <chr>,
#   CurrentEmploymentTitle <chr>

Notice that when we joined both dataset, we can find several discrepancies in the data, First, at each moment the last4ccnum and loyaltynum is the same. However, the CarID is different. The reason to this is because we simply join the vehicle and purchases dataset by the time stamp. This is wrong as there might be occasion where at the same moment, one person could be moving about but not purchasing anything while the other could be at a shop buying stuff.

To resolve this discrepancies. We will need to make use of the map data to find out the actual places based on the lat long data provided. Thus, we will proceed to Question 3.

4.3. Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.

TO infer the cc and loyalty card data to the CarID, we will need to use the movement data using the geospatial data that was provided. We will first take a look at the tourist map in jpg form. Notice that there are some locations that were provided in the cc and loyalty dataset but not detectable in the tourist map. As such, there are some discrepancies in the map given.

Next, the jpg file provided does not have any movement data projected into it. As such, we will need to use an additional software to import the movement data into the map. The software to allow us to import the movement data into the map is QGIS georeferencing software. The end product is as shown below.

Next, we will read the Tag Image File (TIF) that we have created into RStudio via the function raster. Raster is a package to read, write and manipulate spatial data into R.

map <- raster("data/MC2/MC2-tourist_modified.tif")

map
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 1569, 2723, 4272387  (nrow, ncol, ncell)
resolution : 3.194864e-05, 3.194864e-05  (x, y)
extent     : 24.82361, 24.9106, 36.04522, 36.09534  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : MC2-tourist_modified.tif 
names      : MC2.tourist_modified 
values     : 0, 255  (min, max)

Next, we will output the raster layer using tm_shape.

tm_shape(map) +
tm_rgb(map, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255)

Next, we will import the GIS data layer in the ESRI shapefile format.

Abila_st <- st_read(dsn = "data/MC2/Geospatial",
layer = "Abila")
Reading layer `Abila' from data source 
  `C:\ongcheehong\DataViz\_posts\2021-07-04-vast-challenge-2021-assignment\data\MC2\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84

Next, we will create a day factor, e.g. 2014-01-06 will be day 6. This will us to filter the day more easily when we analyse the car movement later.

gps_cleaned$day <- as.factor(get_day(gps_cleaned$Timestamp))

gps_cleaned
# A tibble: 685,169 x 5
   Timestamp           CarID   lat  long day  
   <dttm>              <fct> <dbl> <dbl> <fct>
 1 2014-01-06 06:28:00 35     36.1  24.9 6    
 2 2014-01-06 06:28:00 35     36.1  24.9 6    
 3 2014-01-06 06:28:00 35     36.1  24.9 6    
 4 2014-01-06 06:28:00 35     36.1  24.9 6    
 5 2014-01-06 06:28:00 35     36.1  24.9 6    
 6 2014-01-06 06:28:00 35     36.1  24.9 6    
 7 2014-01-06 06:28:00 35     36.1  24.9 6    
 8 2014-01-06 06:28:00 35     36.1  24.9 6    
 9 2014-01-06 06:28:00 35     36.1  24.9 6    
10 2014-01-06 06:28:00 35     36.1  24.9 6    
# ... with 685,159 more rows

Converting aspatial data into simple feature data frame.

gps_sf <- st_as_sf(gps_cleaned,
  coords = c("long", "lat"),
  crs= 4326)

gps_sf
Simple feature collection with 685169 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 4
   Timestamp           CarID day              geometry
 * <dttm>              <fct> <fct>         <POINT [°]>
 1 2014-01-06 06:28:00 35    6     (24.87469 36.07623)
 2 2014-01-06 06:28:00 35    6      (24.8746 36.07622)
 3 2014-01-06 06:28:00 35    6     (24.87444 36.07621)
 4 2014-01-06 06:28:00 35    6     (24.87425 36.07622)
 5 2014-01-06 06:28:00 35    6     (24.87417 36.07621)
 6 2014-01-06 06:28:00 35    6     (24.87406 36.07619)
 7 2014-01-06 06:28:00 35    6     (24.87391 36.07619)
 8 2014-01-06 06:28:00 35    6     (24.87381 36.07618)
 9 2014-01-06 06:28:00 35    6     (24.87374 36.07617)
10 2014-01-06 06:28:00 35    6     (24.87362 36.07618)
# ... with 685,159 more rows

Creation of movement path

gps_path <- gps_sf %>%
group_by(CarID, day) %>%
summarize(m_Timestamp = mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")

gps_path
Simple feature collection with 508 features and 3 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 508 x 4
# Groups:   CarID [40]
   CarID day   m_Timestamp                                    geometry
   <fct> <fct> <dttm>                                 <LINESTRING [°]>
 1 1     6     2014-01-06 15:01:40 (24.88258 36.06646, 24.88259 36.06~
 2 1     7     2014-01-07 12:40:38 (24.87957 36.04803, 24.87957 36.04~
 3 1     8     2014-01-08 14:34:56 (24.88265 36.06643, 24.88266 36.06~
 4 1     9     2014-01-09 12:04:17 (24.88261 36.06646, 24.88257 36.06~
 5 1     10    2014-01-10 16:04:29 (24.88265 36.0665, 24.88261 36.066~
 6 1     11    2014-01-11 16:18:03 (24.88258 36.06651, 24.88246 36.06~
 7 1     12    2014-01-12 13:30:36 (24.88259 36.06643, 24.8824 36.066~
 8 1     13    2014-01-13 13:45:46 (24.88265 36.06642, 24.8826 36.066~
 9 1     14    2014-01-14 14:03:54 (24.88261 36.06644, 24.88262 36.06~
10 1     15    2014-01-15 15:33:25 (24.88263 36.06647, 24.88257 36.06~
# ... with 498 more rows

Before we plot the GPS path, we will need to remove the orphan lines which are represented by p = 1.

    p = npts(gps_path, by_feature = TRUE)
    gps_path2 <- cbind(gps_path, p)
                
    gps_path_revised <- gps_path2 %>%
      filter(p != 1)
    
    gps_path_revised
Simple feature collection with 507 features and 4 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
First 10 features:
   CarID day         m_Timestamp    p                       geometry
1      1   6 2014-01-06 15:01:40 2263 LINESTRING (24.88258 36.066...
2      1   7 2014-01-07 12:40:38 1702 LINESTRING (24.87957 36.048...
3      1   8 2014-01-08 14:34:56 1988 LINESTRING (24.88265 36.066...
4      1   9 2014-01-09 12:04:17 1247 LINESTRING (24.88261 36.066...
5      1  10 2014-01-10 16:04:29 1500 LINESTRING (24.88265 36.066...
6      1  11 2014-01-11 16:18:03 1070 LINESTRING (24.88258 36.066...
7      1  12 2014-01-12 13:30:36  590 LINESTRING (24.88259 36.066...
8      1  13 2014-01-13 13:45:46 1832 LINESTRING (24.88265 36.066...
9      1  14 2014-01-14 14:03:54 1710 LINESTRING (24.88261 36.066...
10     1  15 2014-01-15 15:33:25 1403 LINESTRING (24.88263 36.066...

With the above prepared data. We are now ready to plot the GPS path into the map. First, we will filter the data by day == 10. To sieve out all those vehicle movements in day 10.

Next, we will tmap to produce the map, afterwhich the tm_shape and tm_lines are use to plot the movement lines onto the map. Lastly, we will create some layout such as label and legend to differentiate the CarID. The geospatial map produced is interactive whereby user are able to hover their mouse to a line in which a tooltip of the CarID will show up.

Note that currently, there is no interative button or dropdown list for user to select the day or CarID they wanted. A shiny app will be required to perform this role. However, the shinyapp will not be shown this assignment.

gps_path_selected <- gps_path_revised %>%
filter(day == 10)
gps_path_selected
Simple feature collection with 36 features and 4 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82588 ymin: 36.04802 xmax: 24.90848 ymax: 36.08996
Geodetic CRS:  WGS 84
First 10 features:
   CarID day         m_Timestamp    p                       geometry
1      1  10 2014-01-10 16:04:29 1500 LINESTRING (24.88265 36.066...
2      2  10 2014-01-10 12:05:20 2239 LINESTRING (24.86044 36.085...
3      3  10 2014-01-10 11:47:38 1885 LINESTRING (24.85757 36.086...
4      4  10 2014-01-10 12:53:58 1708 LINESTRING (24.87213 36.078...
5      5  10 2014-01-10 15:07:50 2198 LINESTRING (24.87793 36.067...
6      6  10 2014-01-10 16:03:54 2358 LINESTRING (24.89485 36.059...
7      7  10 2014-01-10 11:50:46 1883 LINESTRING (24.86422 36.084...
8      8  10 2014-01-10 16:52:09 1855 LINESTRING (24.88597 36.067...
9      9  10 2014-01-10 13:07:23 1218 LINESTRING (24.85646 36.084...
10    10  10 2014-01-10 12:42:31 2116 LINESTRING (24.86586 36.076...
tmap_mode("view")
m <- tm_shape(map) +
tm_rgb(map, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_selected) +
tm_lines(col = "CarID",palette="-RdYlBu") +
  tm_text(text = "CarID") +
  tm_layout (title = "Map of car movements") 

lf <- tmap_leaflet(m)

lf2 <- leaflet(m) %>%
  addTiles(group = "CarID") %>%
  addLayersControl(baseGroups = c("CarID"))

lf

Next, we will show how to infer the CarID to the cc dataset.The process is as follows:

First, we will output the cc_loyalty_join dataset which is the joined cc and loyalty dataset. Next, we will filter to a random date such as 2014-01-10 as shown below. Taking reference to the location Ouzeri Elian. We can see that there are 6 instances on day 10.

cc_loyalty_join %>%
  filter(date == "2014-01-10") %>%
  arrange(location) 
# A tibble: 84 x 8
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-10 13:36:00 Abila Zacha~  36.7  1310       Fri   2014-01-10
 2 2014-01-10 13:46:00 Abila Zacha~  31.3  2681       Fri   2014-01-10
 3 2014-01-10 13:59:00 Abila Zacha~   8.22 1874       Fri   2014-01-10
 4 2014-01-10 14:01:00 Abila Zacha~  16.1  3492       Fri   2014-01-10
 5 2014-01-10 14:03:00 Abila Zacha~  36.4  7819       Fri   2014-01-10
 6 2014-01-10 14:09:00 Abila Zacha~   9.32 6691       Fri   2014-01-10
 7 2014-01-10 20:11:00 Albert's Fi~ 126.   6895       Fri   2014-01-10
 8 2014-01-10 12:00:00 Bean There ~  15.4  1415       Fri   2014-01-10
 9 2014-01-10 12:00:00 Bean There ~   3.92 9635       Fri   2014-01-10
10 2014-01-10 12:00:00 Bean There ~  13.4  1877       Fri   2014-01-10
# ... with 74 more rows, and 2 more variables: time <chr>,
#   loyaltynum <fct>

Now, we will head back to the map to record down all the CarID that was present at Ouzeri Elian. Notice that there were 6 CarIDs, 1,5,9,15,18,21. Note! 1) Not always will there be the same amount of CarIDs as the cc_loyalty_join dataset. However, the modality is still the same which is to record down all the CarIDs. 2) the lines might not always stop at the location itself which is the same as Ouzeri Elian as shown below. A tip is to find places there have a “U-turn” in which indicates the person stop at the place and left after that.

Next, we will filter the cc_loyalty_join dataset by the last4ccnum of the 6 CarIDs. For our example, we will use 1877. Notice that last4ccnum 1877 went to Bean there done that.

cc_loyalty_join %>%
  filter(last4ccnum== 1877, date == "2014-01-10") %>%
  arrange(location) 
# A tibble: 2 x 8
  timestamp           location price last4ccnum day   date       time 
  <dttm>              <chr>    <dbl> <fct>      <ord> <date>     <chr>
1 2014-01-10 12:00:00 Bean Th~  13.4 1877       Fri   2014-01-10 12:00
2 2014-01-10 13:50:00 Ouzeri ~  15.2 1877       Fri   2014-01-10 13:50
# ... with 1 more variable: loyaltynum <fct>

Now, we will head back to the map and hover our mouse to Bean there done that, we should be able to see CarID 9 pops up. The rest of the CarIDs do not match any of the CarIDs at Ouzeri Elian. As such CarID 9 corresponds to last4ccnum of 1877 and loyatlynum L3014.

This examination of both data sources does not always work as there are some discrepancies and limitation in using the map data such as the 1) the location is not accurate to the said location like Ouzeri Elian. 2) If on the day, the particular user went to only one place, however, there are a lot of CarIDs at the same location which will makes it difficult to detect the CarID. 3) There are places like Hippokampos which is not indicated in the map.

Most of the above limitations and discrepancies can be resolve by trial and error on other days. Another tip is to create another map and filter by day and CarID to clearly see the movement of the car and whether does is match the cc dataset. However, there are still some CarIDs that are not able to be infered due to lack of good information or false information provided.

gps_path_selected <- gps_path_revised %>%
filter(CarID == 9, day == 10)
tmap_mode("view")
tm_shape(map) +
tm_rgb(map, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_selected) +
tm_lines()

Tagging CarID to cc and loyalty dataset

Below is a dataframe that has been created to tag the CarID to the cc and loyalty datasets.

Notice that there are some user with multiple cc.

Those CarIDs that are not able to inferred are 20,23,24,28,32,102 & 103. Note that 102 and 103 are not tag with any cc dataset because there is no movement data provided for these two vehicles.

car_cc <- data.frame("CarID" = c(1,1,1,2,3,4,4,5,5,6,7,8,8,9,10,10,11,12,13,13,13,14,14,14,15,15,16,16,17,18,18,18,18,19,21,22,22,
                                 25,25,25,26,26,27,29,29,29,30,30,30,30,31,33,34,35,101,101,104,105,106,106,107),"last4ccnum" = c(7108,9551,2681,1415,9635,8156,7688,6899,7117,7253,2540,5368,5368,1877,8332,8332,1321,7108,5407,8202,6691,7889,7889,1874,3853,8129,4795,4795,
  7384,9617,7354,4434,7792,6895,9405,1286,1286,6816,2418,2142,7819,1310,3492, 3547,5921,5921,8411,4948,4948,6901,5010,9683,3484,2463,9220,9220,8642,9152,2276,3506,4530), "loyaltynum" = c("L6544","L5777","L1107","L7783","L3191","L5224","L4164", "L6267","L6417","L1682","L5947", "L2247","L6119","L3014","L2070","L8566","L4149","L6544","L4034","L2343","L6267","L6119","L2247","L4424","L1485","L8328","L8566","L2070","L3800","L5553","L9254","L2169","L5756","L3366","L3259","L3288","L3572","L8148","L9018","L9637","L5259","L8012","L7814","L9362","L3295","L9406","L6110","L9406","L3295","L9363","L2459","L7291","L2490","L6886","L4063","L7761","L2769","L5485","L3317","L7761","L8477"))


car_cc$last4ccnum <- as.factor(car_cc$last4ccnum)
car_cc$loyaltynum <- as.factor(car_cc$loyaltynum)

car_cc
   CarID last4ccnum loyaltynum
1      1       7108      L6544
2      1       9551      L5777
3      1       2681      L1107
4      2       1415      L7783
5      3       9635      L3191
6      4       8156      L5224
7      4       7688      L4164
8      5       6899      L6267
9      5       7117      L6417
10     6       7253      L1682
11     7       2540      L5947
12     8       5368      L2247
13     8       5368      L6119
14     9       1877      L3014
15    10       8332      L2070
16    10       8332      L8566
17    11       1321      L4149
18    12       7108      L6544
19    13       5407      L4034
20    13       8202      L2343
21    13       6691      L6267
22    14       7889      L6119
23    14       7889      L2247
24    14       1874      L4424
25    15       3853      L1485
26    15       8129      L8328
27    16       4795      L8566
28    16       4795      L2070
29    17       7384      L3800
30    18       9617      L5553
31    18       7354      L9254
32    18       4434      L2169
33    18       7792      L5756
34    19       6895      L3366
35    21       9405      L3259
36    22       1286      L3288
37    22       1286      L3572
38    25       6816      L8148
39    25       2418      L9018
40    25       2142      L9637
41    26       7819      L5259
42    26       1310      L8012
43    27       3492      L7814
44    29       3547      L9362
45    29       5921      L3295
46    29       5921      L9406
47    30       8411      L6110
48    30       4948      L9406
49    30       4948      L3295
50    30       6901      L9363
51    31       5010      L2459
52    33       9683      L7291
53    34       3484      L2490
54    35       2463      L6886
55   101       9220      L4063
56   101       9220      L7761
57   104       8642      L2769
58   105       9152      L5485
59   106       2276      L3317
60   106       3506      L7761
61   107       4530      L8477

4.4. Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

Network analysis is used to detect relationship between people of interests.

To start our journey of network analysis. We will first fulljoin the dataset of cc_loyalty_join to car_cc. The full join will indicates an NA that does not have any data on the other dataset.On the below tibble, we will be able to note that there are CarIDs that have NA. The reason being those are likely truck drivers in which the car_cc dataset do not have any indicative value for that.

car_cc_join <- cc_loyalty_join %>%
  full_join(car_cc, by = c("last4ccnum" , "loyaltynum"))

car_cc_join
# A tibble: 1,104 x 9
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~ 11.3  4795       Mon   2014-01-06
 2 2014-01-06 07:35:00 Brew've Been~  8.33 6816       Mon   2014-01-06
 3 2014-01-06 07:36:00 Hallowed Gro~ 16.7  9617       Mon   2014-01-06
 4 2014-01-06 07:37:00 Brew've Been~  4.24 7384       Mon   2014-01-06
 5 2014-01-06 07:38:00 Brew've Been~  4.17 5368       Mon   2014-01-06
 6 2014-01-06 07:43:00 Brew've Been~  9.6  4948       Mon   2014-01-06
 7 2014-01-06 07:43:00 Brew've Been~ 16.9  9683       Mon   2014-01-06
 8 2014-01-06 07:47:00 Hallowed Gro~ 16.5  8129       Mon   2014-01-06
 9 2014-01-06 07:48:00 Hallowed Gro~ 10.7  3492       Mon   2014-01-06
10 2014-01-06 07:49:00 Coffee Camel~  8.39 5921       Mon   2014-01-06
# ... with 1,094 more rows, and 3 more variables: time <chr>,
#   loyaltynum <fct>, CarID <dbl>

We will take a look at those CarID with NA values. There are a total of some of the rows that the CarID cant be inferred.

car_cc_na <- cc_loyalty_join %>%
  full_join(car_cc, by = c("last4ccnum" , "loyaltynum")) %>%
  filter(is.na(CarID))

car_cc_na
# A tibble: 31 x 9
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 10:02:00 Nationwide ~  762.  9735       Mon   2014-01-06
 2 2014-01-06 11:33:00 Stewart and~ 2145.  9735       Mon   2014-01-06
 3 2014-01-06 12:00:00 Jack's Magi~   18.6 9241       Mon   2014-01-06
 4 2014-01-07 10:11:00 Nationwide ~ 1347.  9735       Tue   2014-01-07
 5 2014-01-07 12:00:00 Jack's Magi~   18.8 9241       Tue   2014-01-07
 6 2014-01-07 20:15:00 Ouzeri Elian   11.4 9241       Tue   2014-01-07
 7 2014-01-08 10:14:00 Nationwide ~ 3674.  9735       Wed   2014-01-08
 8 2014-01-08 11:58:00 Stewart and~ 1674.  9735       Wed   2014-01-08
 9 2014-01-08 12:00:00 Jack's Magi~   18.8 9241       Wed   2014-01-08
10 2014-01-09 10:16:00 Nationwide ~  674.  9735       Thu   2014-01-09
# ... with 21 more rows, and 3 more variables: time <chr>,
#   loyaltynum <fct>, CarID <dbl>

We will drop those NA data that cant be inferred. We have atotal of 1073 rows in our joined dataset.

car_cc_dropna <- car_cc_join %>%
  drop_na(CarID) 

car_cc_dropna$CarID = as.factor(car_cc_dropna$CarID)

car_cc_dropna 
# A tibble: 1,073 x 9
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~ 11.3  4795       Mon   2014-01-06
 2 2014-01-06 07:35:00 Brew've Been~  8.33 6816       Mon   2014-01-06
 3 2014-01-06 07:36:00 Hallowed Gro~ 16.7  9617       Mon   2014-01-06
 4 2014-01-06 07:37:00 Brew've Been~  4.24 7384       Mon   2014-01-06
 5 2014-01-06 07:38:00 Brew've Been~  4.17 5368       Mon   2014-01-06
 6 2014-01-06 07:43:00 Brew've Been~  9.6  4948       Mon   2014-01-06
 7 2014-01-06 07:43:00 Brew've Been~ 16.9  9683       Mon   2014-01-06
 8 2014-01-06 07:47:00 Hallowed Gro~ 16.5  8129       Mon   2014-01-06
 9 2014-01-06 07:48:00 Hallowed Gro~ 10.7  3492       Mon   2014-01-06
10 2014-01-06 07:49:00 Coffee Camel~  8.39 5921       Mon   2014-01-06
# ... with 1,063 more rows, and 3 more variables: time <chr>,
#   loyaltynum <fct>, CarID <fct>

Next, we will select out those data from the car assignment dataset that we are required for our network analysis study.

car_unite_select <- car_unite %>%
  select(name, CarID, CurrentEmploymentType, CurrentEmploymentTitle)

car_unite_select
# A tibble: 44 x 4
   name              CarID CurrentEmploymentType CurrentEmploymentTit~
   <chr>             <fct> <chr>                 <chr>                
 1 Calixto, Nils     1     Information Technolo~ IT Helpdesk          
 2 Azada, Lars       2     Engineering           Engineer             
 3 Balas, Felix      3     Engineering           Engineer             
 4 Barranco, Ingrid  4     Executive             SVP/CFO              
 5 Baza, Isak        5     Information Technolo~ IT Technician        
 6 Bergen, Linnea    6     Information Technolo~ IT Group Manager     
 7 Orilla, Elsa      7     Engineering           Drill Technician     
 8 Alcazar, Lucas    8     Information Technolo~ IT Technician        
 9 Cazar, Gustav     9     Engineering           Drill Technician     
10 Campo-Corrente, ~ 10    Executive             SVP/CIO              
# ... with 34 more rows
We will next join the car_cc dataset with the car assignment dataset. Notice that there is a reduction in number of rows, those rows that are taken out are trucks which does not have any CarID.
carassgn_car_cc_join <- car_cc_dropna %>%
  inner_join(car_unite_select, by = c("CarID"))

carassgn_car_cc_join
# A tibble: 1,023 x 12
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~ 11.3  4795       Mon   2014-01-06
 2 2014-01-06 07:35:00 Brew've Been~  8.33 6816       Mon   2014-01-06
 3 2014-01-06 07:36:00 Hallowed Gro~ 16.7  9617       Mon   2014-01-06
 4 2014-01-06 07:37:00 Brew've Been~  4.24 7384       Mon   2014-01-06
 5 2014-01-06 07:38:00 Brew've Been~  4.17 5368       Mon   2014-01-06
 6 2014-01-06 07:43:00 Brew've Been~  9.6  4948       Mon   2014-01-06
 7 2014-01-06 07:43:00 Brew've Been~ 16.9  9683       Mon   2014-01-06
 8 2014-01-06 07:47:00 Hallowed Gro~ 16.5  8129       Mon   2014-01-06
 9 2014-01-06 07:48:00 Hallowed Gro~ 10.7  3492       Mon   2014-01-06
10 2014-01-06 07:49:00 Coffee Camel~  8.39 5921       Mon   2014-01-06
# ... with 1,013 more rows, and 6 more variables: time <chr>,
#   loyaltynum <fct>, CarID <fct>, name <chr>,
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>

How do we find out the relationship between a person to another. Normally, when friends went out together to have lunch, they will pay separetely on the spot. This is indicative of a close relationship. For our study, we will cut the transaction time in breaks of 5 mins to try and sieve out those similar transactions that might be indicative of a close relationship between people.

To do that, we will need to cut the timestamp by 5mins break as shown below. You will notice that a new row time_5 is created and break by 5 mins.

ccassgn_5min <- carassgn_car_cc_join  %>%
  mutate(time_5 = cut(timestamp, breaks="5 min"))

ccassgn_5min
# A tibble: 1,023 x 13
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:28:00 Brew've Been~ 11.3  4795       Mon   2014-01-06
 2 2014-01-06 07:35:00 Brew've Been~  8.33 6816       Mon   2014-01-06
 3 2014-01-06 07:36:00 Hallowed Gro~ 16.7  9617       Mon   2014-01-06
 4 2014-01-06 07:37:00 Brew've Been~  4.24 7384       Mon   2014-01-06
 5 2014-01-06 07:38:00 Brew've Been~  4.17 5368       Mon   2014-01-06
 6 2014-01-06 07:43:00 Brew've Been~  9.6  4948       Mon   2014-01-06
 7 2014-01-06 07:43:00 Brew've Been~ 16.9  9683       Mon   2014-01-06
 8 2014-01-06 07:47:00 Hallowed Gro~ 16.5  8129       Mon   2014-01-06
 9 2014-01-06 07:48:00 Hallowed Gro~ 10.7  3492       Mon   2014-01-06
10 2014-01-06 07:49:00 Coffee Camel~  8.39 5921       Mon   2014-01-06
# ... with 1,013 more rows, and 7 more variables: time <chr>,
#   loyaltynum <fct>, CarID <fct>, name <chr>,
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>,
#   time_5 <fct>

Next, we will filter out those columns that are required for our network analysis study.

ccassgn_5min_select <- ccassgn_5min %>%
  select(CarID,time_5,location,name,day)

ccassgn_5min_select
# A tibble: 1,023 x 5
   CarID time_5              location            name            day  
   <fct> <fct>               <chr>               <chr>           <ord>
 1 16    2014-01-06 07:28:00 Brew've Been Served Vann, Isia      Mon  
 2 25    2014-01-06 07:33:00 Brew've Been Served Herrero, Kanon  Mon  
 3 18    2014-01-06 07:33:00 Hallowed Grounds    Frente, Birgit~ Mon  
 4 17    2014-01-06 07:33:00 Brew've Been Served Flecha, Sven    Mon  
 5 8     2014-01-06 07:38:00 Brew've Been Served Alcazar, Lucas  Mon  
 6 30    2014-01-06 07:43:00 Brew've Been Served Resumir, Felix  Mon  
 7 33    2014-01-06 07:43:00 Brew've Been Served Tempestad, Bra~ Mon  
 8 15    2014-01-06 07:43:00 Hallowed Grounds    Bodrogi, Loreto Mon  
 9 27    2014-01-06 07:48:00 Hallowed Grounds    Orilla, Kare    Mon  
10 29    2014-01-06 07:48:00 Coffee Cameleon     Ovan, Bertrand  Mon  
# ... with 1,013 more rows
Next, we will join the same table with each other to examine any close relationship between each other. Notice that there are CarID_1 and CarID_2. As there should not be two CarID that are the same with each other. We will take it out on the next code chunk.
car_location_edges <- ccassgn_5min_select %>%
  inner_join(ccassgn_5min_select, by = c("time_5","location","day"), suffix = c("_1","_2"))

car_location_edges
# A tibble: 1,577 x 7
   CarID_1 time_5      location     name_1     day   CarID_2 name_2   
   <fct>   <fct>       <chr>        <chr>      <ord> <fct>   <chr>    
 1 16      2014-01-06~ Brew've Bee~ Vann, Isia Mon   16      Vann, Is~
 2 25      2014-01-06~ Brew've Bee~ Herrero, ~ Mon   25      Herrero,~
 3 25      2014-01-06~ Brew've Bee~ Herrero, ~ Mon   17      Flecha, ~
 4 18      2014-01-06~ Hallowed Gr~ Frente, B~ Mon   18      Frente, ~
 5 17      2014-01-06~ Brew've Bee~ Flecha, S~ Mon   25      Herrero,~
 6 17      2014-01-06~ Brew've Bee~ Flecha, S~ Mon   17      Flecha, ~
 7 8       2014-01-06~ Brew've Bee~ Alcazar, ~ Mon   8       Alcazar,~
 8 30      2014-01-06~ Brew've Bee~ Resumir, ~ Mon   30      Resumir,~
 9 30      2014-01-06~ Brew've Bee~ Resumir, ~ Mon   33      Tempesta~
10 33      2014-01-06~ Brew've Bee~ Tempestad~ Mon   30      Resumir,~
# ... with 1,567 more rows

Next, we will group the two carids together with the time, location and day to find out those carids that goes to the same place around the same time.

car_location_edges_aggregated <- car_location_edges %>%
  group_by(CarID_1, CarID_2, time_5, location, day) %>%
  summarise(Weight = n()) %>%
  filter(CarID_1 != CarID_2) %>%
  ungroup()


car_location_edges_aggregated
# A tibble: 510 x 6
   CarID_1 CarID_2 time_5              location         day   Weight
   <fct>   <fct>   <fct>               <chr>            <ord>  <int>
 1 1       5       2014-01-16 20:23:00 Ouzeri Elian     Thu        1
 2 1       6       2014-01-07 13:33:00 Gelatogalore     Tue        1
 3 1       8       2014-01-15 13:38:00 Ouzeri Elian     Wed        1
 4 1       9       2014-01-07 13:43:00 Kalami Kafenion  Tue        1
 5 1       12      2014-01-07 07:48:00 Hallowed Grounds Tue        1
 6 1       12      2014-01-07 13:43:00 Kalami Kafenion  Tue        1
 7 1       12      2014-01-08 20:58:00 General Grocer   Wed        1
 8 1       12      2014-01-09 14:03:00 Ouzeri Elian     Thu        1
 9 1       12      2014-01-10 07:48:00 Hallowed Grounds Fri        1
10 1       12      2014-01-11 14:13:00 Kalami Kafenion  Sat        1
# ... with 500 more rows

Next, we will drop those truck values which are NA from the car assignment dataset to prevent any error from forming during the network analysis.

car_nodes_with_truck <- car_unite %>%
  select(CarID, name,CurrentEmploymentType,CurrentEmploymentTitle)

car_nodes <- car_nodes_with_truck %>%
  drop_na(CarID)

car_nodes
# A tibble: 35 x 4
   CarID name              CurrentEmploymentType CurrentEmploymentTit~
   <fct> <chr>             <chr>                 <chr>                
 1 1     Calixto, Nils     Information Technolo~ IT Helpdesk          
 2 2     Azada, Lars       Engineering           Engineer             
 3 3     Balas, Felix      Engineering           Engineer             
 4 4     Barranco, Ingrid  Executive             SVP/CFO              
 5 5     Baza, Isak        Information Technolo~ IT Technician        
 6 6     Bergen, Linnea    Information Technolo~ IT Group Manager     
 7 7     Orilla, Elsa      Engineering           Drill Technician     
 8 8     Alcazar, Lucas    Information Technolo~ IT Technician        
 9 9     Cazar, Gustav     Engineering           Drill Technician     
10 10    Campo-Corrente, ~ Executive             SVP/CIO              
# ... with 25 more rows

Next, a tbl_grpah is used to manipulate tidygraph. This is use to prepare for network analysis by indicating the nodes and edges.

car_location_graph <- tbl_graph(nodes = car_nodes,
                                edges = car_location_edges_aggregated,
                                directed = TRUE)

car_location_graph
# A tbl_graph: 35 nodes and 510 edges
#
# A directed multigraph with 6 components
#
# Node Data: 35 x 4 (active)
  CarID name             CurrentEmploymentType  CurrentEmploymentTitle
  <fct> <chr>            <chr>                  <chr>                 
1 1     Calixto, Nils    Information Technology IT Helpdesk           
2 2     Azada, Lars      Engineering            Engineer              
3 3     Balas, Felix     Engineering            Engineer              
4 4     Barranco, Ingrid Executive              SVP/CFO               
5 5     Baza, Isak       Information Technology IT Technician         
6 6     Bergen, Linnea   Information Technology IT Group Manager      
# ... with 29 more rows
#
# Edge Data: 510 x 6
   from    to time_5              location     day   Weight
  <int> <int> <fct>               <chr>        <ord>  <int>
1     1     2 2014-01-16 20:23:00 Ouzeri Elian Thu        1
2     1     3 2014-01-07 13:33:00 Gelatogalore Tue        1
3     1     4 2014-01-15 13:38:00 Ouzeri Elian Wed        1
# ... with 507 more rows

Next, we will create a facet grid of each employment type to see their relationship. If we notice, the executive does not have any links in between which indicates that their relationship is not that close as compared to engineering and security.

set_graph_style()
g <- ggraph(car_location_graph,
layout =
"nicely") +
geom_edge_link(aes(width=Weight),
alpha=
0.2) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(colour = CurrentEmploymentType),
size =2)
g + facet_nodes(~CurrentEmploymentType)+
th_foreground(foreground ="grey80",
border =TRUE) +
theme(legend.position ='bottom')

Next, we will conduct a betweeness centrality for our network analysis. We can see that security and engineering has bigger circle which indicates that they are in between many other nodes which shows that they are close to others.

g <- car_location_graph %>%
mutate(betweenness_centrality = centrality_betweenness()) %>%
ggraph(layout =
"fr"
) +
geom_edge_link(aes(width=Weight),
alpha=
0.2
) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(colour = CurrentEmploymentType,
size=betweenness_centrality))
g + theme_graph()

Next, we will visualize any possible community. A size of more than 1 indicates a community between groups of people. If we are able to see, those with size more than 1 are of the engineering department which shows that they have closely link with one another.

g <- car_location_graph %>%
mutate(community = as.factor(group_edge_betweenness(weights = Weight, directed =
TRUE
))) %>%
ggraph(layout =
"fr"
) +
geom_edge_link(aes(width=Weight),
alpha=
0.2
) +
scale_edge_width(range = c(
0.1
,
5
)) +
geom_node_point(aes(colour = community))
g + theme_graph()

Next, we will prepare for the interactive network analysis using visNetwork.

First, we will rename the CarID to id and current employment type to group which is required for visNetwork

car_nodes_rename <- car_nodes%>%
  rename(id = CarID, group = CurrentEmploymentType)
Next we will conduct the data preparation for visNetwork
car_location_edges_aggregated_vis <- car_location_edges %>%
left_join(car_nodes_rename, by = c("name_1"="name")) %>%
rename(from = id) %>%
left_join(car_nodes_rename, by = c("name_2"="name")) %>%
rename(to = id) %>%
group_by(from,to) %>%
summarise(weight = n()) %>%
filter(weight >1) %>%
ungroup()

Next, we will create the visNetwork network analysis. Notice that the visNetwork is interactive whereby the CarID can be chosen to see the relationship between each CarID.

visNetwork(car_nodes_rename,
car_location_edges_aggregated_vis) %>%
visIgraphLayout(layout =
"layout_with_fr"
) %>%
visOptions(highlightNearest =
TRUE
,
nodesIdSelection =
TRUE
) %>%
visLegend() %>%
visLayout(randomSeed =
123
)

Next, we will repeat the step for cc-loyalty left antijoin.

cc_loyalty_antijoin
# A tibble: 409 x 7
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:34:00 Hallowed Gr~ 5.22e1 7108       Mon   2014-01-06
 2 2014-01-06 07:42:00 Coffee Came~ 2.87e1 7253       Mon   2014-01-06
 3 2014-01-06 07:52:00 Brew've Bee~ 3.28e1 9405       Mon   2014-01-06
 4 2014-01-06 07:54:00 Hallowed Gr~ 1.69e1 7889       Mon   2014-01-06
 5 2014-01-06 07:54:00 Hallowed Gr~ 3.29e1 2681       Mon   2014-01-06
 6 2014-01-06 08:00:00 Hallowed Gr~ 9.69e0 6691       Mon   2014-01-06
 7 2014-01-06 08:14:00 Coffee Came~ 3.2 e0 2418       Mon   2014-01-06
 8 2014-01-06 12:00:00 Coffee Shack 5.15e1 7117       Mon   2014-01-06
 9 2014-01-06 12:00:00 Jack's Magi~ 2.78e1 8156       Mon   2014-01-06
10 2014-01-06 12:16:00 Abila Airpo~ 1.87e3 8642       Mon   2014-01-06
# ... with 399 more rows, and 1 more variable: time <chr>

We will left join the cc_loyalty_antijoin to car_cc data.

car_cc_anti_left <- cc_loyalty_antijoin %>%
  left_join(car_cc, by = c("last4ccnum"))
  
  
car_cc_anti_left
# A tibble: 475 x 9
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:34:00 Hallowed Gro~ 52.2  7108       Mon   2014-01-06
 2 2014-01-06 07:34:00 Hallowed Gro~ 52.2  7108       Mon   2014-01-06
 3 2014-01-06 07:42:00 Coffee Camel~ 28.7  7253       Mon   2014-01-06
 4 2014-01-06 07:52:00 Brew've Been~ 32.8  9405       Mon   2014-01-06
 5 2014-01-06 07:54:00 Hallowed Gro~ 16.9  7889       Mon   2014-01-06
 6 2014-01-06 07:54:00 Hallowed Gro~ 16.9  7889       Mon   2014-01-06
 7 2014-01-06 07:54:00 Hallowed Gro~ 32.9  2681       Mon   2014-01-06
 8 2014-01-06 08:00:00 Hallowed Gro~  9.69 6691       Mon   2014-01-06
 9 2014-01-06 08:14:00 Coffee Camel~  3.2  2418       Mon   2014-01-06
10 2014-01-06 12:00:00 Coffee Shack  51.5  7117       Mon   2014-01-06
# ... with 465 more rows, and 3 more variables: time <chr>,
#   CarID <dbl>, loyaltynum <fct>

Next, we will break the timestamp by 5mins.

cc_anti_left_5min <- car_cc_anti_left %>%
  mutate(time_5 = cut(timestamp, breaks="5 min"))

cc_anti_left_5min
# A tibble: 475 x 10
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:34:00 Hallowed Gro~ 52.2  7108       Mon   2014-01-06
 2 2014-01-06 07:34:00 Hallowed Gro~ 52.2  7108       Mon   2014-01-06
 3 2014-01-06 07:42:00 Coffee Camel~ 28.7  7253       Mon   2014-01-06
 4 2014-01-06 07:52:00 Brew've Been~ 32.8  9405       Mon   2014-01-06
 5 2014-01-06 07:54:00 Hallowed Gro~ 16.9  7889       Mon   2014-01-06
 6 2014-01-06 07:54:00 Hallowed Gro~ 16.9  7889       Mon   2014-01-06
 7 2014-01-06 07:54:00 Hallowed Gro~ 32.9  2681       Mon   2014-01-06
 8 2014-01-06 08:00:00 Hallowed Gro~  9.69 6691       Mon   2014-01-06
 9 2014-01-06 08:14:00 Coffee Camel~  3.2  2418       Mon   2014-01-06
10 2014-01-06 12:00:00 Coffee Shack  51.5  7117       Mon   2014-01-06
# ... with 465 more rows, and 4 more variables: time <chr>,
#   CarID <dbl>, loyaltynum <fct>, time_5 <fct>

Next, we will inner join the carid_cc data to the car assignment dataset to get the names for each CarID.

cc_anti_left_5min$CarID <- as.factor(cc_anti_left_5min$CarID)

car_cc_anti_left <- cc_anti_left_5min %>%
  inner_join(car_unite_select, by = c("CarID"))

car_cc_anti_left
# A tibble: 541 x 13
   timestamp           location      price last4ccnum day   date      
   <dttm>              <chr>         <dbl> <fct>      <ord> <date>    
 1 2014-01-06 07:34:00 Hallowed Gro~ 52.2  7108       Mon   2014-01-06
 2 2014-01-06 07:34:00 Hallowed Gro~ 52.2  7108       Mon   2014-01-06
 3 2014-01-06 07:42:00 Coffee Camel~ 28.7  7253       Mon   2014-01-06
 4 2014-01-06 07:52:00 Brew've Been~ 32.8  9405       Mon   2014-01-06
 5 2014-01-06 07:54:00 Hallowed Gro~ 16.9  7889       Mon   2014-01-06
 6 2014-01-06 07:54:00 Hallowed Gro~ 16.9  7889       Mon   2014-01-06
 7 2014-01-06 07:54:00 Hallowed Gro~ 32.9  2681       Mon   2014-01-06
 8 2014-01-06 08:00:00 Hallowed Gro~  9.69 6691       Mon   2014-01-06
 9 2014-01-06 08:14:00 Coffee Camel~  3.2  2418       Mon   2014-01-06
10 2014-01-06 12:00:00 Coffee Shack  51.5  7117       Mon   2014-01-06
# ... with 531 more rows, and 7 more variables: time <chr>,
#   CarID <fct>, loyaltynum <fct>, time_5 <fct>, name <chr>,
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>
Select out those columsn required
cc_anti_left_5min_select <- car_cc_anti_left %>%
  select(CarID,time_5,location,name,day)

cc_anti_left_5min_select
# A tibble: 541 x 5
   CarID time_5              location            name            day  
   <fct> <fct>               <chr>               <chr>           <ord>
 1 1     2014-01-06 07:34:00 Hallowed Grounds    Calixto, Nils   Mon  
 2 12    2014-01-06 07:34:00 Hallowed Grounds    Cocinaro, Hide~ Mon  
 3 6     2014-01-06 07:39:00 Coffee Cameleon     Bergen, Linnea  Mon  
 4 21    2014-01-06 07:49:00 Brew've Been Served Osvaldo, Hennie Mon  
 5 14    2014-01-06 07:54:00 Hallowed Grounds    Dedos, Lidelse  Mon  
 6 14    2014-01-06 07:54:00 Hallowed Grounds    Dedos, Lidelse  Mon  
 7 1     2014-01-06 07:54:00 Hallowed Grounds    Calixto, Nils   Mon  
 8 13    2014-01-06 07:59:00 Hallowed Grounds    Ferro, Inga     Mon  
 9 25    2014-01-06 08:14:00 Coffee Cameleon     Herrero, Kanon  Mon  
10 5     2014-01-06 11:59:00 Coffee Shack        Baza, Isak      Mon  
# ... with 531 more rows

Join both dataset together to find relationship between personnel.

car_location_anti_left_edges <- cc_anti_left_5min_select %>%
  inner_join(cc_anti_left_5min_select, by = c("time_5","location","day"), suffix = c("_1","_2"))

car_location_anti_left_edges
# A tibble: 1,611 x 7
   CarID_1 time_5      location     name_1     day   CarID_2 name_2   
   <fct>   <fct>       <chr>        <chr>      <ord> <fct>   <chr>    
 1 1       2014-01-06~ Hallowed Gr~ Calixto, ~ Mon   1       Calixto,~
 2 1       2014-01-06~ Hallowed Gr~ Calixto, ~ Mon   12      Cocinaro~
 3 12      2014-01-06~ Hallowed Gr~ Cocinaro,~ Mon   1       Calixto,~
 4 12      2014-01-06~ Hallowed Gr~ Cocinaro,~ Mon   12      Cocinaro~
 5 6       2014-01-06~ Coffee Came~ Bergen, L~ Mon   6       Bergen, ~
 6 21      2014-01-06~ Brew've Bee~ Osvaldo, ~ Mon   21      Osvaldo,~
 7 14      2014-01-06~ Hallowed Gr~ Dedos, Li~ Mon   14      Dedos, L~
 8 14      2014-01-06~ Hallowed Gr~ Dedos, Li~ Mon   14      Dedos, L~
 9 14      2014-01-06~ Hallowed Gr~ Dedos, Li~ Mon   1       Calixto,~
10 14      2014-01-06~ Hallowed Gr~ Dedos, Li~ Mon   14      Dedos, L~
# ... with 1,601 more rows
Remove those rows with same CarID.
car_location_anti_left_edges_aggregated <- car_location_anti_left_edges %>%
  group_by(CarID_1, CarID_2, time_5, location, day) %>%
  summarise(Weight = n()) %>%
  filter(CarID_1 != CarID_2, Weight > 1) %>%
  ungroup()


car_location_anti_left_edges_aggregated
# A tibble: 32 x 6
   CarID_1 CarID_2 time_5             location            day   Weight
   <fct>   <fct>   <fct>              <chr>               <ord>  <int>
 1 1       10      2014-01-19 03:44:~ "Kronos Mart"       Sun        2
 2 1       14      2014-01-06 07:54:~ "Hallowed Grounds"  Mon        2
 3 1       14      2014-01-08 07:54:~ "Hallowed Grounds"  Wed        2
 4 1       14      2014-01-09 07:49:~ "Hallowed Grounds"  Thu        2
 5 8       16      2014-01-10 07:54:~ "Brew've Been Serv~ Fri        4
 6 8       17      2014-01-10 07:54:~ "Brew've Been Serv~ Fri        2
 7 8       18      2014-01-10 07:54:~ "Brew've Been Serv~ Fri        2
 8 8       34      2014-01-08 07:54:~ "Brew've Been Serv~ Wed        2
 9 10      1       2014-01-19 03:44:~ "Kronos Mart"       Sun        2
10 13      14      2014-01-07 20:09:~ "Katerina\u0012s C~ Tue        3
# ... with 22 more rows

Examine those data with na carids which are trucks data.

car_location_anti_left_edges %>%
  filter(is.na(CarID_1))
# A tibble: 846 x 7
   CarID_1 time_5      location       name_1   day   CarID_2 name_2   
   <fct>   <fct>       <chr>          <chr>    <ord> <fct>   <chr>    
 1 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Hafon, A~
 2 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Hawelon,~
 3 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Hawelon,~
 4 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Mies, He~
 5 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Morlun, ~
 6 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Morlun, ~
 7 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Morlunia~
 8 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Nant, Ir~
 9 <NA>    2014-01-06~ Frydos Autosu~ Hafon, ~ Mon   <NA>    Scozzese~
10 <NA>    2014-01-06~ Frydos Autosu~ Hawelon~ Mon   <NA>    Hafon, A~
# ... with 836 more rows

Prepare for visNetwork network analysis.

car_location_anti_left_edges_aggregated_vis <- car_location_anti_left_edges %>%
left_join(car_nodes_rename, by = c("name_1"="name")) %>%
rename(from = id) %>%
left_join(car_nodes_rename, by = c("name_2"="name")) %>%
rename(to = id) %>%
  drop_na(from,to) %>%
group_by(from,to) %>%
summarise(weight = n()) %>%
filter(weight >1) %>%
ungroup()

Network analysis for anti left join cc to loyalty dataset

visNetwork(car_nodes_rename,
car_location_anti_left_edges_aggregated_vis ) %>%
visIgraphLayout(layout =
"layout_with_fr"
) %>%
visOptions(highlightNearest =
TRUE
,
nodesIdSelection =
TRUE
) %>%
visLegend() %>%
visLayout(randomSeed =
123
)

We will do the same for right join loyalty to cc dataset.

car_cc_anti_right <- cc_loyalty_antijoin_right %>%
  left_join(car_cc, by = c("loyaltynum"))
  
  
car_cc_anti_right
# A tibble: 375 x 6
   date       location               price loyaltynum CarID last4ccnum
   <date>     <chr>                  <dbl> <fct>      <dbl> <fct>     
 1 2014-01-06 "Coffee Shack"         11.5  L6417          5 7117      
 2 2014-01-06 "Hallowed Grounds"     12.9  L1107          1 2681      
 3 2014-01-06 "Abila Zacharo"        26.9  L9018         25 2418      
 4 2014-01-06 "Katerina\u0012s Cafi" 38.6  L6110         30 8411      
 5 2014-01-06 "Frydos Autosupply n'~ 89.3  L4034         13 5407      
 6 2014-01-07 "Hallowed Grounds"      8.42 L1107          1 2681      
 7 2014-01-07 "Coffee Cameleon"      12.4  L9018         25 2418      
 8 2014-01-07 "Ouzeri Elian"         17.7  L2247          8 5368      
 9 2014-01-07 "Ouzeri Elian"         17.7  L2247         14 7889      
10 2014-01-07 "Katerina\u0012s Cafi" 25.0  L2343         13 8202      
# ... with 365 more rows

Join car_cc to car assignment dataset.

car_cc_anti_right$CarID <- as.factor(car_cc_anti_right$CarID)

car_cc_anti_right <- car_cc_anti_right %>%
  inner_join(car_unite_select, by = c("CarID"))


car_cc_anti_right
# A tibble: 380 x 9
   date       location       price loyaltynum CarID last4ccnum name   
   <date>     <chr>          <dbl> <fct>      <fct> <fct>      <chr>  
 1 2014-01-06 "Coffee Shack" 11.5  L6417      5     7117       Baza, ~
 2 2014-01-06 "Hallowed Gro~ 12.9  L1107      1     2681       Calixt~
 3 2014-01-06 "Abila Zachar~ 26.9  L9018      25    2418       Herrer~
 4 2014-01-06 "Katerina\u00~ 38.6  L6110      30    8411       Resumi~
 5 2014-01-06 "Frydos Autos~ 89.3  L4034      13    5407       Ferro,~
 6 2014-01-07 "Hallowed Gro~  8.42 L1107      1     2681       Calixt~
 7 2014-01-07 "Coffee Camel~ 12.4  L9018      25    2418       Herrer~
 8 2014-01-07 "Ouzeri Elian" 17.7  L2247      8     5368       Alcaza~
 9 2014-01-07 "Ouzeri Elian" 17.7  L2247      14    7889       Dedos,~
10 2014-01-07 "Katerina\u00~ 25.0  L2343      13    8202       Ferro,~
# ... with 370 more rows, and 2 more variables:
#   CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>

Select out those required columns.

cc_anti_right_select <- car_cc_anti_right %>%
  select(CarID,location,name,date)

cc_anti_right_select
# A tibble: 380 x 4
   CarID location                    name           date      
   <fct> <chr>                       <chr>          <date>    
 1 5     "Coffee Shack"              Baza, Isak     2014-01-06
 2 1     "Hallowed Grounds"          Calixto, Nils  2014-01-06
 3 25    "Abila Zacharo"             Herrero, Kanon 2014-01-06
 4 30    "Katerina\u0012s Cafi"      Resumir, Felix 2014-01-06
 5 13    "Frydos Autosupply n' More" Ferro, Inga    2014-01-06
 6 1     "Hallowed Grounds"          Calixto, Nils  2014-01-07
 7 25    "Coffee Cameleon"           Herrero, Kanon 2014-01-07
 8 8     "Ouzeri Elian"              Alcazar, Lucas 2014-01-07
 9 14    "Ouzeri Elian"              Dedos, Lidelse 2014-01-07
10 13    "Katerina\u0012s Cafi"      Ferro, Inga    2014-01-07
# ... with 370 more rows

Join car_cc dataset to itself to find relationship between personnel.

car_location_anti_right_edges <- cc_anti_right_select %>%
  inner_join(cc_anti_right_select, by = c("location","date"), suffix = c("_1","_2"))

car_location_anti_right_edges
# A tibble: 1,538 x 6
   CarID_1 location           name_1     date       CarID_2 name_2    
   <fct>   <chr>              <chr>      <date>     <fct>   <chr>     
 1 5       "Coffee Shack"     Baza, Isak 2014-01-06 5       Baza, Isak
 2 1       "Hallowed Grounds" Calixto, ~ 2014-01-06 1       Calixto, ~
 3 1       "Hallowed Grounds" Calixto, ~ 2014-01-06 1       Calixto, ~
 4 1       "Hallowed Grounds" Calixto, ~ 2014-01-06 12      Cocinaro,~
 5 25      "Abila Zacharo"    Herrero, ~ 2014-01-06 25      Herrero, ~
 6 25      "Abila Zacharo"    Herrero, ~ 2014-01-06 1       Calixto, ~
 7 25      "Abila Zacharo"    Herrero, ~ 2014-01-06 12      Cocinaro,~
 8 25      "Abila Zacharo"    Herrero, ~ 2014-01-06 18      Frente, B~
 9 30      "Katerina\u0012s ~ Resumir, ~ 2014-01-06 30      Resumir, ~
10 13      "Frydos Autosuppl~ Ferro, In~ 2014-01-06 13      Ferro, In~
# ... with 1,528 more rows
Gather those carids that have close relationship.
car_location_anti_right_edges_aggregated <- car_location_anti_right_edges %>%
  group_by(CarID_1, CarID_2, location, date) %>%
  summarise(Weight = n()) %>%
  filter(CarID_1 != CarID_2, Weight > 1) %>%
  ungroup()


car_location_anti_right_edges_aggregated
# A tibble: 70 x 5
   CarID_1 CarID_2 location         date       Weight
   <fct>   <fct>   <chr>            <date>      <int>
 1 1       4       Ouzeri Elian     2014-01-08      2
 2 1       5       Hallowed Grounds 2014-01-09      2
 3 1       12      Hallowed Grounds 2014-01-06      2
 4 1       12      Hallowed Grounds 2014-01-09      2
 5 1       12      Ouzeri Elian     2014-01-14      3
 6 1       13      Hallowed Grounds 2014-01-09      2
 7 1       14      Hallowed Grounds 2014-01-09      2
 8 1       15      Hallowed Grounds 2014-01-09      2
 9 1       21      Hallowed Grounds 2014-01-09      2
10 1       34      Ouzeri Elian     2014-01-14      3
# ... with 60 more rows
Preapre for visNetwork network analysis
car_location_anti_right_edges_aggregated_vis <- car_location_anti_right_edges %>%
left_join(car_nodes_rename, by = c("name_1"="name")) %>%
rename(from = id) %>%
left_join(car_nodes_rename, by = c("name_2"="name")) %>%
rename(to = id) %>%
   drop_na(from,to) %>%
group_by(from,to) %>%
summarise(weight = n()) %>%
filter(weight >1) %>%
ungroup()
NEtwork analysis for left anti join loyalty to cc dataset.
visNetwork(car_nodes_rename,
car_location_anti_right_edges_aggregated_vis ) %>%
visIgraphLayout(layout =
"layout_with_fr"
) %>%
visOptions(highlightNearest =
TRUE
,
nodesIdSelection =
TRUE
) %>%
visLegend() %>%
visLayout(randomSeed =
123
)

network analysis for left anti join cc to loyalty dataset.

visNetwork(car_nodes_rename,
car_location_anti_left_edges_aggregated_vis ) %>%
visIgraphLayout(layout =
"layout_with_fr"
) %>%
visOptions(highlightNearest =
TRUE
,
nodesIdSelection =
TRUE
) %>%
visLegend() %>%
visLayout(randomSeed =
123
)

Inner joined network analysis.

visNetwork(car_nodes_rename,
car_location_edges_aggregated_vis) %>%
visIgraphLayout(layout =
"layout_with_fr"
) %>%
visOptions(highlightNearest =
TRUE
,
nodesIdSelection =
TRUE
) %>%
visLegend() %>%
visLayout(randomSeed =
123
)

4.5. Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.

4.5.1 Day 11, CarID 26 visited a house at Canero street.

Most of the days, only CarIDs 2, 3 & 11 will be visit this house which presumably is their residence. However on day 11, CarID 26 went to the house.

Based on the network analysis, CarID 26 has no close relationship with CarIDs 2,3 & 11. So it is suspicious on why this employee went there.

4.5.2. Most days. Unidentified place.

On most days, 101, 106 & 107 will meetup at an unidentified location opposite brews been served.

4.5.3. Day 10, Unidentified place

On day 10, CarID 24 stops at an identified place between Frank Fuel, Abila Scrap and Kronos Mart.

4.5.4. Day 10, House at Carnero Street.

On day 10, 1,2,5,6,7,8,9,11,14,18,19,25,26,33 visited or pass by the house which is normally frequented by 2,3 & 11.

4.5.5. Day 13 & 14. House at Barwyn Street

Normally this house is frquented by 4 & 35. However on day 13 & 14, CarID 21 went to or pass by the house.

4.5.6. Most days. Frydo’s Autosupply n More.

On most days, CarID 13, 15, 16 & 21 will pass by Frydo’s Autosupply n More. However, only CarID 13 have business there.

Analysing the network analysis, we can see that all of them do not have close relationship with one another.

4.5.7. Day 18. Kronos Mart

On Day 18, CarID 5 stop at Kronos Mart but did not purchase anything.

4.5.8. Day 17. Chostus Hotel.

Three CarIDs (7,31,33) visited Chostus Hotel. However, only one (33) book the hotel.

All three do not have any indication of close relationship.

4.5.9. Day 8, Near Max Iron and Steel.

Truck 107 stop at an unidentified area near Max Iron and Steel.

4.5.10. Day 16. Near Kronos Mart.

Car 21 and 24 stop in between Frank Fuel, Abila Scrap and Kronos Mart to meet.

There is no indication that both cars have close relationship.